Switching system

ABSTRACT

A system and method for provided a switch system ( 100 ) having a first configurable set of processor elements ( 102 ) to process storage resource connection requests ( 104 ), a second configurable set of processor elements capable of communications with the first configurable set of processor elements ( 102 ) to receive, from the first configurable set of processor elements, storage connection requests representative of client requests, and to route the requests to at least one of the storage elements ( 104 ), and a configurable switching fabric ( 106 ) interconnected between the first and second sets of processor elements ( 102 ), for receiving at least a first storage connection request ( 104 ) from one of the first set of processor elements ( 102 ), determining an appropriate one of the second set of processors for processing the storage connection request ( 104 ), automatically configuring the storage connection request in accordance with a protocol utilized by the selected one of the second set of processors, and forwarding the storage connection request to the selected one of the second set of processors for routing to at least one of the storage elements.

INCORPORATION BY REFERENCE/PRIORITY CLAIM

Commonly owned U.S. provisional application for patent Ser. No.60/245,295 filed Nov. 2, 2000, incorporated by reference herein; and

Commonly owned U.S. provisional application for patent Ser. No.60/301,378 filed Jun. 27, 2001, incorporated by reference herein.

Additional publications are incorporated by reference herein as setforth below.

FIELD OF THE INVENTION

The present invention relates to digital information processing, andparticularly to methods, systems and protocols for managing storage indigital networks.

BACKGROUND OF THE INVENTION

The rapid growth of the Internet and other networked systems hasaccelerated the need for processing, transferring and managing data inand across networks.

In order to meet these demands, enterprise storage architectures havebeen developed, which typically provide access to a physical storagepool through multiple independent SCSI channels interconnected withstorage via multiple front-end and back-end processors/controllers.Moreover, in data networks based on IP/Ethernet technology, standardshave been developed to facilitate network management. These standardsinclude Ethernet, Internet Protocol (IP), Internet Control MessageProtocol (ICMP), Management Information Block (MIB) and Simple NetworkManagement Protocol (SNMP). Network Management Systems (NMSs) such as HPOpen View utilize these standards to discover and monitor networkdevices. Examples of networked architectures are disclosed in thefollowing patent documents, the disclosures of which are incorporatedherein by reference:

U.S. Pat. No. 5,941,972 Crossroads Systems, Inc. U.S. Pat. No. 6,000,020Gadzoox Network, Inc. U.S. Pat. No. 6,041,381 Crossroads Systems, Inc.U.S. Pat. No. 6,061,358 McData Corporation U.S. Pat. No. 6,067,545Hewlett-Packard Company U.S. Pat. No. 6,118,776 Vixel Corporation U.S.Pat. No. 6,128,656 Cisco Technology, Inc. U.S. Pat. No. 6,138,161Crossroads Systems, Inc. U.S. Pat. No. 6,148,421 Crossroads Systems,Inc. U.S. Pat. No. 6,151,331 Crossroads Systems, Inc. U.S. Pat. No.6,199,112 Crossroads Systems, Inc. U.S. Pat. No. 6,205,141 CrossroadsSystems, Inc. U.S. Pat. No. 6,247,060 Alacritech, Inc. WO 01/59966Nishan Systems, Inc.

Conventional systems, however, do not enable seamless connection andinteroperability among disparate storage platforms and protocols.Storage Area Networks (SANs) typically use a completely different set oftechnology based on Fibre Channel (FC) to build and manage storagenetworks. This has led to a “re-inventing of the wheel” in many cases.Users are often require to deal with multiple suppliers of routers,switches, host bus adapters and other components, some of which are notwell-adapted to communicate with one another. Vendors and standardsbodies continue to determine the protocols to be used to interfacedevices in SANs and NAS configurations; and SAN devices do not integratewell with existing IP-based management systems.

Still further, the storage devices (Disks, RAID Arrays, and the like),which are Fibre Channel attached to the SAN devices, typically do notsupport IP (and the SAN devices have limited IP support) and the storagedevices cannot be discovered/managed by IP-based management systems.There are essentially two sets of management products—one for the IPdevices and one for the storage devices.

Accordingly, it is desirable to enable servers, storage andnetwork-attached storage (NAS) devices, IP and Fibre Channel switches onstorage-area networks (SAN), WANs or LANs to interoperate to provideimproved storage data transmission across enterprise networks.

In addition, among the most widely used protocols for communicationswithin and among networks, TCP/IP (TCP/Internet Protocol) is the suiteof communications protocols used to connect hosts on the Internet. TCPprovides reliable, virtual circuit, end-to-end connections fortransporting data packets between nodes in a network. Implementationexamples are set forth in the following patent and other publications,the disclosures of which are incorporated herein by reference:

U.S. Pat. No. 5,260,942 IBM U.S. Pat. No. 5,442,637 ATT U.S. Pat. No.5,566,170 Storage Technology Corporation U.S. Pat. No. 5,598,410 StorageTechnology Corporation U.S. Pat. No. 5,598,410 Storage TechnologyCorporation U.S. Pat. No. 6,006,259 Network Alchemy, Inc. U.S. Pat. No.6,018,530 Sham Chakravorty U.S. Pat. No. 6,122,670 TSI Telsys, Inc. U.S.Pat. No. 6,163,812 IBM U.S. Pat. No. 6,178,448 IBM “TCP/IP IllustratedVolume 2”, Wright, Stevens; “SCSI over TCP”, IETF draft, IBM, CISCO,Sangate, February 2000; “The SCSI Encapsulation Protocol (SEP)”, IETFdraft, Adaptec Inc., May 2000; RFC 793 “Transmission Control Protocol”,September 1981.

Although TCP is useful, it requires substantial processing by the systemCPU, thus limiting throughput and system performance. Designers haveattempted to avoid this limitation through various inter-processorcommunications techniques, some of which are described in theabove-cited publications. For example, some have offloaded TCPprocessing tasks to an auxiliary CPU, which can reside on an intelligentnetwork interface or similar device, thereby reducing load on the systemCPU. However, this approach does not eliminate the problem, but merelymoves it elsewhere in the system, where it remains a single chokepointof performance limitation.

Others have identified separable components of TCP processing andimplemented them in specialized hardware. These can include calculationor verification of TCP checksums over the data being transmitted, andthe appending or removing of fixed protocol headers to or from suchdata. These approaches are relatively simple to implement in hardware tothe extent they perform only simple, condition-invariant manipulations,and do not themselves cause a change to be applied to any persistent TCPstate variables. However, while these approaches somewhat reduce systemCPU load, they have not been observed to provide substantial performancegains.

Some required components of TCP, such as retransmission of a TCP segmentfollowing a timeout, are difficult to implement in hardware, because oftheir complex and condition-dependent behavior. For this reason, systemsdesigned to perform substantial TCP processing in hardware often includea dedicated CPU capable of handling these exception conditions.Alternatively, such systems may decline to handle TCP segmentretransmission or other complex events and instead defer theirprocessing to the system CPU.

However, a major difficulty in implementing such “fast path/slow path”solutions is ensuring that the internal state of the TCP connections,which can be modified as a result of performing these operations, isconsistently maintained, whether the operations are performed by the“fast path” hardware or by the “slow path” system CPU.

It is therefore desirable to provide methods, devices and systems thatsimplify and improve these operations.

It is also desirable to provide methods, devices and systems thatsimplify management of storage in digital networks, and enable flexibledeployment of NAS, SAN and other storage systems, and Fibre Channel(FC), IP/Ethernet and other protocols, with storage subsystem andlocation independence.

SUMMARY OF THE INVENTION

The invention addresses the noted problems typical of prior art systems,and in one aspect, provides a switch system having a first configurableset of processor elements to process storage resource connectionrequests, a second configurable set of processor elements capable ofcommunications with the first configurable set of processor elements toreceive, from the first configurable set of processor elements, storageconnection requests representative of client requests, and to route therequests to at least one of the storage elements, and a configurableswitching fabric interconnected between the first and second sets ofprocessor elements, for receiving at least a first storage connectionrequest from one of the first set of processor elements, determining anappropriate one of the second set of processors for processing thestorage connection request, automatically configuring the storageconnection request in accordance with a protocol utilized by theselected one of the second set of processors, and forwarding the storageconnection request to the selected one of the second set of processorsfor routing to at least one of the storage elements.

Another aspect of the invention provides methods, systems and devicesfor enabling data replication under NFS servers.

A further aspect of the invention provides mirroring of NFS serversusing a multicast function.

Yet another aspect of the invention provides dynamic content replicationunder NFS servers.

In another aspect, the invention provides load balanced NAS using ahashing or similar function, and dynamic data grooming and NFS loadbalancing across NFS servers.

The invention also provides, in a further aspect, domain sharing acrossmultiple FC switches, and secure virtual storage domains (SVSD).

Still another aspect of the invention provides TCP/UDP acceleration,with IP stack bypass using a network processors (NP). The presentinvention simultaneously maintaining TCP state information in both thefast path and the slow path. Control messages are exchanged between thefast path and slow path processing engines to maintain statesynchronization, and to hand off control from one processing engine toanother. These control messages can be optimized to require minimalprocessing in the slow path engines (e.g., system CPU) while enablingefficient implementation in the fast path hardware. This distributedsynchronization approach significantly accelerates TCP processing, butalso provides additional benefits, in that it permits the creation ofmore robust systems.

The invention, in another aspect, also enables automatic discovery ofSCSI devices over an IP network, and mapping of SNMP requests to SCSI.

In addition, the invention also provides WAN mediation caching on localdevices.

Each of these aspects will next be described in detail, with referenceto the attached drawing figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a hardware architecture of one embodiment of the switchsystem aspect of the invention.

FIG. 2 depicts interconnect architecture useful in the embodiment ofFIG. 1.

FIG. 3 depicts processing and switching modules.

FIG. 4 depicts software architecture in accordance with one embodimentof the invention.

FIG. 5 depicts detail of the client abstraction layer.

FIG. 6 depicts the storage abstraction layer.

FIG. 7 depicts scaleable NAS.

FIG. 8 depicts replicated local/remote storage.

FIG. 9 depicts a software structure useful in one embodiment of theinvention.

FIG. 9 a depicts the MIC and MLAN components of FIG. 9

FIG. 9 b depicts the MLAN, LIC and SRC-NAS and fabric components of FIG.9

FIG. 9 c depicts the SRC-Mediator component and fabric of FIG. 9

FIG. 10 depicts system services.

FIG. 11 depicts a management software overview.

FIG. 12 depicts a virtual storage domain.

FIG. 13 depicts another virtual storage domain.

FIG. 14 depicts configuration processing boot-up sequence.

FIG. 15 depicts a further virtual storage domain example.

FIG. 16 is a flow chart of NFS mirroring and related functions.

FIG. 17 depicts interface module software.

FIG. 18 depicts an flow control example.

FIG. 19 depicts hardware in an SRC.

FIG. 20 depicts SRC NAS software modules.

FIG. 21 depicts SCSI/UDP operation.

FIG. 22 depicts SRC software storage components.

FIG. 23 depicts FC originator/FC target operation.

FIG. 24 depicts load balancing NFS client requests between NFS servers.

FIG. 25 depicts NFS receive micro-code flow.

FIG. 26 depicts NFS transmit micro-code flow.

FIG. 27 depicts file handle entry into multiple server lists.

FIG. 28 depicts a sample network configuration in another embodiment ofthe invention.

FIG. 29 depicts an example of a virtual domain configuration.

FIG. 30 depicts an example of a VLAN configuration.

FIG. 31 depicts a mega-proxy example.

FIG. 32 depicts device discovery in accordance with another aspect ofthe invention.

FIG. 33 depicts SNMP/SCSI mapping.

FIG. 34 SCSI response/SNMP trap mapping.

FIG. 35 depicts data structures useful in another aspect of theinvention.

FIG. 36 depicts mirroring and load balancing operation.

FIG. 37 depicts server classes.

FIGS. 38A, 38B, 38C depict mediation configurations in accordance withanother aspect of the invention.

FIG. 39 depicts operation of mediation protocol engines.

FIG. 40 depicts configuration of storage by the volume manager inaccordance with another aspect of the invention.

FIG. 41 depicts data structures for keeping track of virtual devices andsessions.

FIG. 42 depicts mediation manager operation in accordance with anotheraspect of the invention.

FIG. 43 depicts mediation in accordance with one practice of theinvention.

FIG. 44 depicts mediation in accordance with another practice of theinvention.

FIG. 45 depicts fast-path architecture in accordance with the invention.

FIG. 46 depicts IXP packet receive processing for mediation.

DETAILED DESCRIPTION OF THE INVENTION I. Overview

FIG. 1 depicts the hardware architecture of one embodiment of a switchsystem according to the invention. As shown therein, the switch system100 is operable to interconnect clients and storage. As discussed indetail below, storage processor elements 104 (SPs) connect to storage;IP processor elements 102(IP) connect to clients or other devices; and ahigh speed switch fabric 106 interconnects the IP and SP elements, underthe control of control elements 103.

The IP processors provide content-aware switching, load balancing,mediation, TCP/UDP hardware acceleration, and fast forwarding, all asdiscussed in greater detail below. In one embodiment, the high speedfabric comprises redundant control processors and a redundant switchingfabric, provides scalable port density and is media-independent. Asdescribed below, the switch fabric enables media-independent moduleinterconnection, and supports low-latency Fibre Channel (F/C) switching.In an embodiment of the invention commercially available from theassignee of this application, the fabric maintains QoS for Ethernettraffic, is scalable from 16 to 256 Gbps, and can be provisioned asfully redundant switching fabric with fully redundant controlprocessors, ready for 10 Gb Ethernet, InfiniBand and the like. The SPssupport NAS (NFS/CIFS), mediation, volume management, Fibre Channel(F/C) switching, SCSI and RAID services.

FIG. 2 depicts an interconnect architecture adapted for use in theswitching system 100 of FIG. 1. As shown therein, the architectureincludes multiple processors interconnected by dual paths 110, 120. Path110 is a management and control path adapted for operation in accordancewith switched Ethernet. Path 120 is a high speed switching fabric,supporting a point to point serial interconnect. Also as shown in FIG.2, front-end processors include SFCs 130, LAN Resource Cards (LRCs) 132,and Storage Resource Cards (SRCs) 134, which collectively provideprocessing power for the functions described below. Rear-end processorsinclude MICs 136, LIOs 138 and SIOs 140, which collectively providewiring and control for the functions described below.

In particular, the LRCs provide interfaces to external LANs, servers,WANs and the like (such as by 4×Gigabit Ethernet or 32×10/100 Base-TEthernet interface adapters); perform load balancing, content-awareswitching of internal services; implement storage mediation protocols;and provide TCP hardware acceleration.

The SRCs interface to external storage or other devices (such as viaFibre Channel, 1 or 2 Gbps, FC-AL or FC-N)

As shown in FIG. 3, LRCs and LIOs are network processors providingLAN-related functions. They can include GBICs and RJ45 processors. MICsprovide control and management. As discussed below, the switching systemutilizes redundant MICs and redundant fabrics. The FlOs shown in FIG. 3provide F/C switching. These modules can be commercially availableASIC-based F/C switch elements, and collectively enable low cost,high-speed SAN using the methods described below.

FIG. 4 depicts a software architecture adapted for use in an embodimentof switching system 100, wherein a management layer 402 interconnectswith client services 404, mediation services 406, storage services 408,a client abstraction layer 410, and a storage abstraction layer 412. Inturn, the client abstraction layer interconnects with client interfaces(LAN, SAN or other) 414, and the storage abstraction layer interconnectswith storage devices or storage interfaces (LAN, SAN or other) 416.

The client abstraction layer isolates, secures, and protects internalresources; enforces external group isolation and user authentication;provides firewall access security; supports redundant network accesswith fault failover, and integrates IP routing and multiport LANswitching. It addition, it presents external clients with a “Virtualservice” abstraction of internal services, so that there is no need toreconfigure clients when services are changed. Further, it providesinternal services a consistent network interface, wherein serviceconfiguration is independent of network connectivity, and there is noimpact from VLAN topology, multihoming or peering.

FIG. 5 provides detail of the client abstraction layer. As showntherein, it can include TCP acceleration function 502 (which, amongother activities, offloads processing reliable data streams); loadbalancing function 504 (which distributes requests among equivalentresources); content-aware switching 506 (which directs requests to anappropriate resource based on the contents of the requests/packets);virtualization function 508 (which provides isolation and increasedsecurity); 802.1 switching and IP routing function 510 (which supportslink/path redundancy), and physical I/F support functions 512 (which cansupport 10/100Base-T, Gigabit Ethernet, Fibre Channel and the like).

In addition, an internal services layer provides protocol mediation,supports NAS and switching and routing. In particular, in iSCSIapplications the internal services layer uses TCP/IP or the like toprovide LAN-attached servers with access to block-oriented storage; inFC/IP it interconnects Fibre Channel SAN “islands” across an Internetbackbone; and in IP/FC applications it extends IP connectivity acrossFibre Channel. Among NAS functions, the internal services layer includessupport for NFS (industry-standard Network File Service, provided overUDP/IP (LAN) or TCP/IP (WAN); and CIFS (compatible with MicrosoftWindows File Services, also known as SMB. Among switching and routingfunctions, the internal services layer supports Ethernet, Fibre Channeland the like.

The storage abstraction layer shown in FIG. 6 includes file system 602,volume management 604, RAID function 606, storage access processing 608,transport processing 610 an physical I/F support 612. File system layer602 supports multiple file systems; the volume management layer createsand manages logical storage partitions; the RAID layer enables optionaldata replication; the storage access processing layer supports SCSI orsimilar protocols, and the transport layer is adapted for Fibre Channelor SCSI support. The storage abstraction layer consolidates externaldisk drives, storage arrays and the like into a sharable, pooledresource; and provides volume management that allows dynamicallyresizeable storage partitions to be created within the pool; RAIDservice that enables volume replication for data redundancy, improvedperformance; and file service that allows creation of distributed,sharable file systems on any storage partition.

A technical advantage of this configuration is that a single storagesystem can be used for both file and block storage access (NAS and SAN).

FIGS. 7 and 8 depict examples of data flows through the switching system100. (It will be noted that these configurations are provided solely byway of example, and that other configurations are possible.) Inparticular, as will be discussed in greater detail below, FIG. 7 depictsa scaleable NAS example, while FIG. 8 depicts a replicated local/remotestorage example. As shown in FIG. 7, the switch system 100 includessecure virtual storage domain (SVSD) management layer 702, NFS serverscollectively referred to by numeral 704, and modules 706 and 708.

Gigabit module 706 contains TCP 710, load balancing 712, content-awareswitching 714, virtualization 716, 802.1 switching and IP routing 718,and Gigabit (GV) optics collectively referred to by numeral 720.

FC module 708 contains file system 722, volume management 724, RAID 726,SCSI 728, Fibre Channel 730, and FC optics collectively referred to bynumeral 731.

As shown in the scaleable NAS example of FIG. 7, the switch system 100connects clients on multiple Gigabit Ethernet LANs 732 (or similar) to(1) unique content on separate storage 734 and replicated filesystemsfor commonly accessed files 736. The data pathways depicted run from theclients, through the GB optics, 802.1 switching and IP routing,virtualization, content-aware switching, load balancing and TCP, intothe NFS servers (under the control/configuration of SVSD management),and into the file system, volume management, RAID, SCSCI, Fibre Channel,and FC optics to the unique content (which bypasses RAID), andreplicated filesystems (which flows through RAID).

Similar structures are shown in the replicated local/remote storageexample of FIG. 8. However, in this case, the interconnection is betweenclients on Gigagbit Ethernet LAN (or similar) 832, secondary storage atan offsite location via a TCP/IP network 834, and locally attachedprimary storage 836. In this instance, the flow is from the clients,through the GB optics, 802.1 switching and IP routing, virtualization,content-aware switching, load balancing and TCP, then through iSCSImediation services 804 (under the control/configuration of SVSDmanagement 802), then through volume management 824, and RAID 826. Then,one flow is from RAID 826 through SCSI 828, Fibre Channel 830 and FCOptics 831 to the locally attached storage 836; while another flow isfrom RAID 826 back to TCP 810, load balancing 812, content-awareswitching 814, virtualization 816, 802.1 switching and IP routing 818and GB optics 820 to secondary storage at an offsite location via aTCP/IP network 834.

II. Hardware/Software Architecture

This section provides an overview of the structure and function of theinvention (alternatively referred to hereinafter as the “Pirus box”). Inone embodiment, the Pirus box is a 6 slot, carrier class, highperformance, multi-layer switch, architected to be the core of the datastorage infrastructure. The Pirus box will be useful for ASPs(Application Storage Providers), SSPs (Storage Service Providers) andlarge enterprise networks. One embodiment of the Pirus box will supportNetwork Attached Storage (NAS) in the form of NFS attached disks off ofFibre Channel ports. These attached disks are accessible via 10/100/1000switched Ethernet ports. The Pirus box will also support standard layer2 and Layer 3 switching with port-based VLAN support, and layer 3routing (on unlearned addresses). RIP will be one routing protocolsupported, with OSPF and others also to be supported. The Pirus box willalso initiate and terminate a wide range of SCSI mediation protocols,allowing access to the storage media either via Ethernet or SCSI/FC. Thebox is manageable via a CLI, SNMP or an HTTP interface.

1 Software Architecture Overview

FIG. 9 is a block diagram illustrating the software modules used in thePirus box (the terms of which are defined in the glossary set forthbelow). As shown in FIG. 9, the software structures correspond to MIC902, LIC 904, SRC-NAS 908 and SRC-Mediator 910, interconnected by MLAN905 and fabric 906. The operation of each of the components shown in thedrawing is discussed below.

1.1 System Services

The term System Service is used herein to denote a significant functionthat is provided on every processor in every slot. It is contemplatedthat many such services will be provided; and that they can be segmentedinto 2 categories: 1) abstracted hardware services and 2) client/serverservices. The attached FIG. 10 is a diagram of some of the exemplaryinterfaces. As shown in FIG. 10, the system services correspond to IPCs1002 and 1004 associated with fabric and control channel 1006, and withservices SCSI 1008, RSS 1010, NPCS 1012, AM 1014, Log/Event 1016,Cache/Bypass 1018, TCP/IP 1020, and SM 1022.

1.1.1 SanStreaM (SSM) System Services (S2)

SSM system service can be defined as a service that provides a softwareAPI layer to application software while “hiding” the underlying hardwarecontrol. These services may add value to the process by adding protocollayering or robustness to the standard hardware functionality.

System services that are provided include:

Card Processor Control Manager (CPCM). This service provides a mechanismto detect and manage the issues involved in controlling a Network EngineCard (NEC) and its associated Network Processors (NP). They includeinsertion and removal, temperature control, crash management, loader,watchdog, failures etc.

Local Hardware Control (LHC). This controls the hardware local to theboard itself. It includes LEDS, fans, and power.

Inter-Processor Communication (IPC). This includes control bus andfabric services, and remote UART.

1.1.2 SSM Application Service (AS)

Application services provide an API on top of SSM system services. Theyare useful for executing functionality remotely.

Application Services include:

Remote Shell Service (RSS)—includes redirection of debug and othervaluable info to any pipe in the system.

Statistics Provider—providers register with the stats consumer toprovide the needed information such as mib read only attributes.

Network Processor Config Service (NPCS)—used to receive and processconfiguration requests.

Action Manager—used to send and receive requests to execute remotefunctionality such as rebooting, clearing stats and re-syncing with afile system.

Logging Service—used to send and receive event logging information.

Buffer Management—used as a fast and useful mechanism for allocating,typing, chaining and freeing message buffers in the system.

HTTP Caching/Bypass service—sub-system to supply an API and functionalservice for HTTP file caching and bypass. It will make the determinationto cache a file, retrieve a cached file (on board or off), and bypass afile (on board or not). In addition this service will keep track oflocal cached files and their associated TTL, as well as statistics onfile bypassing. It will also keep a database of known files and theircaching and bypassing status.

Multicast services—A service to register, send and receive multicastpackets across the MLAN.

2. Management Interface Card

The Management Interface Card (MIC) of the Pirus box has a single highperformance microprocessor and multiple 10/100 Ethernet interfaces foradministration of the SANStream management subsystem. This card also hasa PCMCIA device for bootstrap image and configuration storage.

In the illustrated embodiments, the Management Interface Card will notparticipate in any routing protocol or forwarding path decisions. The IPstack and services of VxWorks will be used as the underlying IPfacilities for all processes on the MIC. The MIC card will also have aflash based, DOS file system.

The MIC will not be connected to the backplane fabric but will beconnected to the MLAN (Management LAN) in order to send/receive datato/from the other cards in the system. The MLAN is used for all MIC“other cards” communications.

2.1. Management Software

Management software is a collection of components responsible forconfiguration, reporting (status, statistics, etc), notification(events) and billing data (accounting information). The managementsoftware may also include components that implement services needed bythe other modules in the system.

Some of the management software components can exist on any processor inthe system, such as the logging server. Other components reside only onthe MIC, such as the WEB Server providing the WEB user interface.

The strategy and subsequent architecture must be flexible enough toprovide a long-term solution for the product family. In other words, the1.0 implementation must not preclude the inclusion of additionalmanagement features in subsequent releases of the product.

The management software components that can run on either the MIC or NECneed to meet the requirement of being able to “run anywhere” in thesystem.

2.2 Management Software Overview

In the illustrated embodiments the management software decomposes intothe following high-level functions, shown in FIG. 11. As shown in theexample of FIG. 11 (other configurations are also possible and withinthe scope of the invention), management software can be organized intoUser Interfaces (UIs) 1102, rapid control backplane (RCB) datadictionary 1104, system abstraction model (SAM) 1106, configuration &statistics manager (CSM) 1108, and logging/billing APIs 1110, on module1101. This module can communicate across system services (S2) 1112 andhardware elements 1114 with configuration & statistics agent (CSA) 1116and applications 1118.

The major components of the management software include the following:

2.2.1 User Interfaces (UIs)

These components are the user interfaces that allow the user access tothe system via a CLI, HTTP Client or SNMP Agent.

2.2.2 Rapid Control Backplane (RCB)

These components make up the database or data dictionary ofsettable/gettable objects in the system. The Uls use “Rapid Marks”(keys) to reference the data contained within the database. The actuallocation of the data specified by a Rapid Mark may be on or off the MIC.

2.2.3 System Abstraction Model (SAM)

These components provide a software abstraction of the physicalcomponents in the system. The SAM works in conjunction with the RCB toget/set data for the UIs. The SAM determines where the data resides andif necessary interacts with the CSM to get/set the data.

2.2.4 Configuration & Statistics Manager (CSM)

These components are responsible for communicating with the other cardsin the system to get/set data. For example the CSM sends configurationdata to a card/processor when a UI initiates a change and receivesstatistics from a card/processor when a UI requests some data.

2.2.5 Logging/Billing APIs

These components interface with the logging and event servers providedby System Services and are responsible for sending logging/billing datato the desired location and generating SNMP traps/alerts when needed.

2.2.6 Configuration & Statistics Agent (CSA)

These components interface with the CSM on the MIC and responds to CSMmessages for configuration/statistics data.

2.3 Dynamic Configuration

The SANStream management system will support dynamic configurationupdates. A significant advantage is that it will be unnecessary toreboot the entire chassis when an NP's configuration is modified. Thebootstrap configuration can follow similar dynamic guidelines. Bootstrapconfiguration is merely dynamic configuration of an NP that is in thereset state.

Both soft and hard configuration will be supported. Soft configurationallows dynamic modification of current system settings.

Hard configuration modifies bootstrap or start-up parameters. A hardconfiguration is accomplished by saving a soft configuration. A hardconfiguration change can also be made by (T)FTP of a configuration tile.The MIC will not support local editing of configuration files.

In a preferred practice of the invention DNS services will be availableand utilized by MIC management processes to resolve hostnames into IPaddresses.

2.4 Management Applications

In addition to providing “rote” management of the system, the managementsoftware will be providing additional management applications/functions.The level of integration with the WEB UI for these applications can beleft to the implementer. For example the Zoning Manager could be eitherbe folded into the HTML pages served by the embedded HTTP server OR theHTTP server could serve up a stand-alone JAVA Applet.

2.4.1 Volume Manager

A preferred practice of the invention will provide a volume managerfunction. Such a Volume Manager may support:

-   -   Raid 0—Striping    -   Raid 1—Mirroring    -   Hot Spares    -   Aggregating several disks into a large volume.    -   Partitioning a large disk into several smaller volumes.        2.4.2 Load Balancer

This application configures the load balancing functionality. Thisinvolves configuring policies to guide traffic through the system to itsultimate destination. This application will also report status and usagestatistics for the configured policies.

2.4.3 Server-less Backup (NDMP)

This application will support NDMP and allow for serverless back up.This will allow users the ability to back up disk devices to tapedevices without a server intervening.

2.4.4 IP-ized Storage Management

This application will “hide” storage and FC parameters from IP-centricadministrators. For example, storage devices attached to FC ports willappear as IP devices in an HP-OpenView network map. These devices willbe “ping-able”, “discoverable” and support a limited scope of MIBvariables.

In order to accomplish this IP addresses be assigned to the storagedevices (either manually or automatically) and the MIC will have to besent all IP Mgmt (exact list TBD) packets destined for one of thestorage IP addresses. The MIC will then mediate by converting the IPpacket (request) to a similar FC/SCSI request and sending it to thedevice.

For example an IP Ping would become a SCSI Inquiry while a SNMP get ofsysDescription would also be a SCSI Inquiry with some of the returneddata (from the Inquiry) mapped into the MIB variable and returned to therequestor. These features are discussed in greater detail in the IPStorage Management section below.

2.4.5 Mediation Manager

This application is responsible for configuring, monitoring and managingthe mediation between storage and networking protocols. This includessession configurations, terminations, usage reports, etc. These featuresare discussed in greater detail in the Mediation Manager section below.

2.4.6 VLAN Manager

Port level VLANs will be supported. Ports can belong to more than oneVLAN.

The VLAN Manager and Zoning Manager could be combined into a VDM (orsome other name) Manager as a way of unifying the Ethernet and FCworlds.

2.4.7 File System Manager

The majority of file system management will probably be to “accept thedefaults”. There may be an exception if it is necessary to format diskswhen they are attached to a Pirus system or perform other diskoperations.

2.5 Virtual Storage Domain (VSD)

Virtual storage domains serve 2 purposes.

-   -   1. Logically group together a collection of resources.    -   2. Logically group together and “hide” a collection of resources        from the outside world.        The 2 cases are very similar. The second case is used when we        are load balancing among NAS servers.        FIG. 12 Illustrates the First Example:

In this example Server 1 1226 is using SCSI/IP to communicate to Disks Aand B at a remote site while Server 2 1224 is using SCSI/IP tocommunicate with Disks C and D 1208 at the same remote site. For thisconfiguration Disks A, B, C, and D must have valid IP addresses.Logically inside the PIRUS system 2 Virtual Domains are created, one forDisks A and B and one for Disks C and D. The IFF software doesn't needto know about the VSDs since the IP addresses for the disks are valid(exportable) it can simply forward the traffic to the correctdestination. The VSD is configured for the management of the resources(disks).

The second usage of virtual domains is more interesting. In this caselet's assume we want to load balance among 3 NAS servers. A VSD would becreated and a Virtual IP Address (VIP) assigned to it. External entitieswould use this VIP to address the NAS and internally the PIRUS systemwould use NAT and policies to route the request to the correct NASserver. FIG. 13 illustrates this.

In this example users of the NAS service would simple reference the VIPfor Joe's ASP NAS LB service. Internally, through the combination ofvirtual storage domains and policies the Pirus system load balances therequest among 3 internal NAS servers 1306, 1308, 1310, thus providing ascalable, redundant NAS solution.

Virtual Domains can be use to virtualize the entire Pirus system.

Within VSDs the following entities are noteworthy:

2.5.1 Services

Services represent the physical resources. Examples of services are:

-   -   1. Storage Devices attached to FC or Ethernet ports. These        devices can be simple disks, complex RAID arrays, FC-AL        connections, tape devices, etc.    -   2. Router connections to the Internet.    -   3. NAS—Internally defined ones only.        2.5.2 Policies

A preferred practice of the invention can implement the following typesof policies:

-   -   1. Configuration Policy—A policy to configure another policy or        a feature. For example a NAS Server in a virtual domain will be        configured as a “Service”. Another way to look at it is that a        Configuration Policy is simply the collection of configurable        parameters for an object.    -   2. Usage Policy—A policy to define how data is handled. In our        case load balancing is an example of a “Usage Policy”. When a        user configures load balancing they are defining a policy that        specifies how to distribute client requests based on a set of        criteria.

There are many ways t o describe a policy or policies. For our purposeswe will define a policy as composed of the following:

-   -   1. Policy Rules—1 or more rules describing “what to do”. A rule        is made up of condition(s) and actions. Conditions can be as        simple as “match anything” or as complex as “if source IP        address 1.1.1.1 and it's 2:05”. Likewise, actions can be as        simple as “send to 2.2.2.2” or complex as “load balance using        LRU between 3 NAS servers.)    -   2. Policy Domain—A collection of object(s) Policy Rules apply        to. For example, suppose there was a policy that said “load        balance using round robin”. The collection of NAS servers being        load balanced is the policy domain for the policy.

Policies can be nested to form complex policies.

2.6 Boot Sequence and Configuration

The MIC and other cards coordinate their actions during boot upconfiguration processing via System Service's Notify Service. Theseactions need to be coordinated in order to prevent the passing oftraffic before configuration file processing has completed.

The other cards need to initialize with default values and set the stateof their ports to “hold down” and wait for a “Config Complete” eventfrom the MIC. Once this event is received the ports can be released andprocess traffic according to the current configuration. (Which may bedefault values if there were no configuration commands for the ports inthe configuration file.)

FIG. 14 illustrates this part of the boot up sequence and interactionsbetween the MIC, S2 Notify and other cards.

There is an error condition in this sequence where the card neverreceives the “Config Complete” event. Assuming the software is workingproperly than this condition is caused by a hardware problem and theports on the cards will be held in the “hold down” state. If CSM/CSA isworking properly than the MIC Mgmt Software will show the ports down orCPCM might detect that the card is not responding and notify the MIC. Inany case there are several ways to learn about and notify users aboutthe failure.

3. LIC Software

The LIC (Lan Interface Card) consists of LAN Ethernet ports of10/100/1000 Mbps variety. Behind the ports are 4 network engineprocessors. Each port on a LIC will behave like a layer 2 and layer 3switch. The functionality of switching and intelligent forwarding isreferred to herein as IFF—Intelligent Forwarding and Filtering. The mainpurpose of the network engine processors is to forward packets based onLayer 2, 3, 4 or 5 information. The ports will look and act like routerports to hosts on the LAN. Only RIP will be supported in the firstrelease, with OSPF to follow.

3.1 VLANs

The box will support port based VLANs. The division of the ports will bebased on configuration and initially all ports will belong to the sameVLAN. Alternative practices of the invention can include VLANclassification and tagging, including possibly 802.1p and 802.1 Qsupport.

3.1.1 Intelligent Filtering and Forwarding (IFF)

The IFF features are discussed in greater detail below. Layer 2 andlayer 3 switching will take place inside the context of IFF. Forwardingtable entries are populated by layer 2 and 3 address learning. If anentry is not known the packet is sent to the IP routing layer and it isrouted at that level.

3.2 Load Balance Data Flow

NFS load balancing will be supported within a SANStream chassis. Loadbalancing based upon VIRUTAL IP addresses, content and flows are allpossible.

The SANStream box will monitor the health of internal NFS servers thatare configured as load balancing servers and will notify networkmanagement of detectable issues as well as notify a disk managementlayer so that recovery may take place. It will in these cases, stopsending requests to the troubled server, but continue to load balanceacross the remaining NFS servers in the virtual domain.

3.3 LIC—NAS Software

3.3.1 Virtual Storage Domains (VSD)

FIG. 15 provides another VSD example. The switch system of the inventionis designed to support, in one embodiment, multiple NFS and CIFS serversin a single device that are exported to the user as a single NFS server(only NFS is supported on the first release). These servers are maskedunder a single IP address, known as a Virtual Storage Domain (VSD). EachVSD will have one to many connections to the network via a NetworkProcessor (NP) and may also have a pool of Servers (will be referred toas “Server” throughout this document) connected to the VSD via thefabric on the SRC card.

Within a virtual domain there are policy domains. These sub-layersdefine the actions needed to categorize the frame and send it to thenext hop in the tree. These polices can define a large range ofattributes in a frame and then impose an action (implicit or otherwise).Common polices may include actions based on protocol type (NFS, CIFS,etc.) or source and destination IP or MAC address. Actions may includeimplicit actions like forwarding the frame on to the next policy forfurther processing, or explicit actions such as drop.

FIG. 15 diagrams a hypothetical virtual storage domain owned by Fred'sASP 1502. In this example Fred has the configured address of 1.1.1.1that is returned by the domain name service when queried for thedomain's IP address. The next level of configuration is the policydomain. When a packet arrives into the Pirus box from a router port itis classified as a member of Fred's virtual domain because of itsdestination IP address. Once the virtual domain has been determined itsconfiguration is loaded in and a policy decision is made based on theconfigured policy. In the example above lets assume an NFS packetarrived. The packet will be associated with the NFS policy domain and aNAT (network address translation—described below) takes place, with thedestination address that of the NFS policy domain. The packet now getsassociated with the NFS policy domain for Yahoo. The process continueswith the configuration of the NFS policy being loaded in and a decisionbeing made based on the configured policy. In the example above the nextdecision to be made is whether or not the packet contains the gold,silver, or bronze service. Once that determination is made (let's assumethe client was identified as a gold customer), a NAT is performed againto make the destination the IP address of the Gold policy domain. Thepacket now gets associated with the Gold policy domain. The processcontinues with the configuration for the Gold policy being loaded in anda decision being made based on the configured policy. At this point aload balancing decision is made to pick the best server to handle therequest. Once the server is picked, NAT is again performed and thedestination IP address of the server is set in the packet. Once thedestination IP address of the packet becomes a device configured forload balancing, a switching operation is made and the packet is sent outof the box.

The implementation of the algorithm above lends itself to recursion andmay or may not incur as many NAT steps as described. It is left to theimplementer to short cut the number of NAT's while maintaining theoverall integrity of the algorithm.

FIG. 15 also presents the concept of port groups 1512, 1516. Port groupsare entities that have identical functionality and are members of thesame virtual domain. Port group members provide a service. Bydefinition, any member of a particular port group, when presented with arequest, must be able to satisfy that request. Port groups may haverouters, administrative entities, servers, caches, or other Pirus boxesoff of them.

Virtual Storage Domains can reside across slots but not boxes. More thanone Virtual Storage Domain can share a Router Interface.

3.3.2 Network Address Translation (NAT)

NAT translates from one IP Address to another IP Address. The reasonsfor doing NAT is for Load Balancing, to secure the identity of eachServer from the Internet, to reduce the number of IP Addressespurchased, to reduce the number of Router ports needed, and the like.

Each Virtual Domain will have an IP Address that is advertised thru thenetwork NP ports. The IP Address is the address of the Virtual Domainand NOT the NFS/CIFS Server IP Address. The IP Address is translated atthe Pirus device in the Virtual Storage Domain to the Server's IPAddress. Depending on the Server chosen, the IP Address is translated tothe terminating Server IP Address.

For example, in FIG. 15, IP Address 100.100.100.100 would translate to1.1.1.1, 1.1.1.2 or 1.1.1.3 depending on the terminating Server.

3.3.3 Local Load Balance (LLB)

Local load balancing defines an operation of balancing between devices(i.e. servers) that are connected directly or indirectly off the portsof a Pirus box without another load balancer getting involved. Alower-complexity implementation would, for example, support only thebalancing of storage access protocols that reside in the Pirus box.

3.3.3.1 Load Balancing Order of Operations

In the process of load balancing configuration it may be possible todefine multiple load balancing algorithms for the same set of servers.The need then arises to apply an order of operations to the loadbalancing methods. They are as follows in the order they are applied:

-   -   1) Server loading info, Percentage of loading on the servers        Ethernet, Percentage of loading on the servers FC port, SLA        support, Ratio Weight rating    -   2) Round Trip Time, Response time, Packet Rate, Completion Rate    -   3) Round Robin, Least Connections, Random

Load balancing methods in the same group are treated with the sameweight in determining a servers loading. As the load balancingalgorithms are applied, servers that have identical load characteristics(within a certain configured percentage) are moved to the next level inorder to get a better determination of what server is best prepared toreceive the request. The last load balancing methods that will beapplied across the servers that have the identical load characteristics(again within a configured percentage) are round robin, least connectionand random.

3.3.3.2 File System Server Load Balance (FSLB)

The system of the invention is intended to provide load balancing acrossat least two types of file system servers, NFS and CIFS. NFS isstateless and CIFS is stateful so there are differences to each method.The goal of file system load balancing is not only to pick the bestidentical server to handle the request, but to make a single virtualstorage domain transparently hidden behind multiple servers.

3.3.3.3 NFS Server Load Balancing (NLB)

NFS is mostly stateless and idempotent (every operation returns the sameresult if it is repeated). This is qualified because operations such asREAD are idempotent but operations such as REMOVE are not. Since thereis little NFS server state as well as little NFS client statetransferred from one server to the other, it is easy for one server toassume the other server's functions. The protocol will allow for aclient to switch NFS requests from one server to another transparently.This means that the load balancer can more easily maintain an NFSsession if a server fails. For example if in the middle of a request aserver dies, the client will retry, the load balancer will pick anotherserver and the request gets fulfilled (with possibly a file handle NAT),after only a retry. If the server dies between requests, then thereisn't even a retry, the load balancer just picks a new server andfulfills the request (with possibly a file handle NAT).

When using NFS managers it will be possible to set up the load balancerto load across multiple NFS servers that have identical data, ormanagers can set up load balancing to segment the balancing acrossservers that have unique data. The latter requires virtual domainconfiguration based on file requested (location in the file system tree)and file type. The former requires a virtual domain and minimal.otherconfiguration (i.e. load balancing policy).

The function of Load Balance Data Flow is to distribute the processingof requests over multiple servers. Load Balance Data Flow is the same asthe Traditional Data Flow but the NP statistically determines the loadof each server that is part of the specified NFS request and forwardsthe request based on that server load. The load-balancing algorithmcould be as simple as round robin or a more sophisticated administratorconfigured policy.

Server load balance decisions are made based upon IP destinationaddress. For any server IP address, a routing NP may have a table ofconfigured alternate server IP addresses that can process an HTTPtransaction. Thus multiple redundant NFS servers are supported usingthis feature.

TCP based server load balance decisions are made within the NP on a perconnection basis. Once a server is selected through the balancingalgorithm all transactions on a persistent TCP connection will be madeto the same originally targeted server. An incoming IP message's sourceIP address and IP source Port number are the only connection lookup keysused by a NP.

For example, suppose a URL request arrives for 192.32.1.1. The Router NPprocessors lookup determines that server 192.32.1.1 is part of a ServerGroup (192.32.1.1, 192.32.1.2, etc.). The NP decides which Server Groupto forward the request to via user-configured algorithm. Round-Robin,estimated actual load, and current connection count are all candidatesfor selection algorithms. If TCP is the transport protocol, the TCPsession is then terminated at the specified SRC processor.

UDP protocols do not have an opening SYN exchange that must be absorbedand spoofed by the load balancing IXP. Instead each UDP packet can beviewed as a candidate for balancing. This is both good and bad. The lackof opening SYN simplifies part of the balance operation, but the effortof balancing each packet could add considerable latency to UDPtransactions.

In some cases it will be best to make an initial balance decision andkeep a flow mapped for a user configurable time period. Once the periodhas expired an updated balance decision can be made in the backgroundand a new balanced NFS server target selected.

In many cases it will be most efficient to re-balance a flow during arelatively idle period. Many disk transactions result in forward lookingactions on the server (people who read the 1st half of a file often wantthe 2nd half soon afterwards) and rebalancing during active disktransactions could actually hurt performance.

An amendment to the “time period” based flow balancing described abovewould be to arm the timer for an inactivity period and re-arm itwhenever NFS client requests are received. A longer inactivity timerperiod could be used to determine when a flow should be deleted entirelyrather than re-balanced.

3.3.3.4 TCP and UDP—Methods of Balancing

NFS can run over both TCP and UDP (UDP being more prevalent). Whenprocessing UDP NFS requests the method used for psuedo-proxy of TCPsessions does not need to be employed. During a UDP session, theinformation to make a rational load balancing decision can be made withthe first packet.

Several methods of load balancing are possible. The first and simplestto implement is load balancing based on source address—all requests aresent to the same server for a set period of time after a load balancingdecision is made to pick the best server at the UDP request or the TCPSYN.

Another method is to load balance every request with no regard for theprevious server the client was directed to. This will possibly requireobtaining a new file handle from the new server and NATing so as to hidethe file handle change from the client. This method also carries with itmore overhead in processing (every request is load balanced) and moreimplementation effort, but does give a more balanced approach.

Yet another method for balancing NFS requests is to cache a “nextbalance” target based on previous experience. This avoids the overheadof extensive balance decision making in real time, and has the benefitof more even client load distribution.

In order to reduce the processing of file handle differences betweenidentical internal NFS servers, all disk modify operations will bestrictly ordered. This will insure that the inode numbering isconsistent across all identical disks.

Among the load balancing methods that can be used (others are possible)are:

-   -   Round Robin    -   Least Connections    -   Random (lower IP-bits, hashing)    -   Packet Rate (minimum throughput)    -   Ratio Weight rating    -   Server loading info and health as well as application health    -   Round Trip Time (TCP echo)    -   Response time

3.3.3.5 Write Replication

NFS client read and status transactions can be freely balanced across aVLAN family of peer NFS servers. Any requests that result in diskcontent modification (file create, delete, set-attributes, data write,etc.) must be replicated to all NFS servers in a VLAN server peer group.

The Pirus Networks switch fabric interface (SFI) will be used tomulticast NFS modifications to all NFS servers in a VLAN balancing peergroup. All NFS client requests generate server replies and have a uniquetransaction ID. This innate characteristic of NFS can be used to verifyand confirm the success of multicast requests.

At least two mechanisms can be used for replicated transactionconfirmation. They are “first answer” and quorum. Using the “firstanswer” algorithm an IXP would keep minimal state for an outstanding NFSrequest, and return the first response it receives back to the client.The quorum system would require the IXP to wait for some percentage ofthe NFS peer servers to respond with identical messages before returningone to the client.

Using either method, unresponsive NFS servers are removed from the VLANpeer balancing group. When a server is removed from the group the PirusNFS mirroring service must be notified so that recovery procedures canbe initiated.

A method for coordinating NFS write replication is set forth in FIG. 16,including the following steps: check for NFS replication packet 1602; ifyes, multicast packet to entire VLAN NFS server peer group 1604; waitfor 1^(st) NFS server reply with timeout 1608; send 1^(st) server replyto client 1610; remove unresponsive servers from LB group and inform NFSmirroring service 1610. If not an NFS replication packet, load balanceand unicast to NFS server 1606.

3.3.4 Load Balancer Failure Indication

When a load balancer declares that a peer NFS server is being droppedfrom the group the NFS mirroring service is notified. A determinationmust be made as to whether the disk failure was soft or hard.

In the case of a soft failure a hot synchronization should be attemptedto bring the failing NFS server back online. All NFS modify transactionsmust be recorded for playback to the failing NFS server when it returnsto service.

When a hard failure has occurred an administrator must be notified andfresh disk will be brought online, formatted, and synchronized.

3.3.4.1 CIFS Server Load Balancing

CIFS is stateful and as such there are fewer options available for loadbalancing. CIFS is a session-oriented protocol; a client is required tolog on to a server using simple password authentication or a more securecryptographic challenge. CIFS supports no recovery guarantees if thesession is terminated through server or network outage. Therefore loadbalancing of CIFS requests must be done once at TCP SYN and persistencemust be maintained throughout the session. If a disk fails and not theCIFS server, then a recovery mechanism can be employed to transfer statefrom one server to another and maintain the session. However if theserver fails (hardware or software) and there is no way to transferstate from the failed server to the new server, then the TCP sessionmust be brought down and the client must reestablish a new connectionwith a new server. This means relogging and recreating state in the newserver.

Since CIFS is TCP based the balancing decision will be made at the TCPSYN. Since the TCP session will be terminated at the destination server,that server must be able to handle all requests that the client believesexists under that domain. Therefore all CIFS servers that are masked bya single virtual domain must have identical content on them. Secondlydata that spans an NFS server file system must be represented as aseparate virtual domain and accessed by the client as another CIFSserver (i.e. another mount point).

Load balancing will support source address based persistence and sendall requests to the same server based on a timeout since inactivity.Load balancing methods used will be:

-   -   Round Robin    -   Least Connections    -   Random (lower IP-bits, hashing)    -   Packet Rate (minimum throughput)    -   Ratio Weight rating    -   Server loading info and health as well as application health    -   Round Trip Time (TCP echo)    -   Response time

3.3.4.2 Content Load Balance

Content load balancing is achieved by delving deeper into packetcontents than simple destination IP address.

Through configuration and policy it will be possible to re-target NFStransactions to specific servers. based upon NFS header information. Forexample a configuration policy may state that all files under a certaindirectory load balanced between the two specified NFS servers.

A hierarchy of load balancing rules may be established when Server LoadBalancing is configured subordinate to Content Load Balancing.

3.4 LIC—SCSI/IP Software

3.5 Network Processor Functionality

FIG. 17 is a top-level block diagram of the software on an NP. Note thatthe implementation of a block may be split across the policy processorand the micro-engines. Note also that not all blocks may be present onall NPs. The white blocks are common (in concept and to some level ofimplementation) between all NPs, the lightly shaded blocks are presenton NP that have load balancing and storage server health checkingenabled on them.

3.5.1 Flow Control

3.5.1.1 Flow Definition

Flows are defined as source port, destination port, and source anddestination IP address. Packets are tagged coming into the box andclassified by protocol, destination port and destination IP address.Then based on policy and/or TOS bit a priority is assigned within theclass. Classes are associated with a priority when compared to otherclasses. Within the same class priorities are assigned to packets basedon the TOS bit setting and/or policy.

3.5.1.2 Flow Control Model

Flow control will be provided within the SANStream product to the extentdescribed in this section. Each of the egress Network Processors willperform flow control. There will be a queue High Watermark that whenapproached will cause flow control indications from egress NetworkProcessor to offending Network Processors based on QoS policy. Theoffending Network Processor will narrow TCP windows (when present) toreduce traffic flow volumes. If the egress Network Processors exceeds aHard Limit (something higher than the High Watermark), the egressNetwork Processor will perform intelligent dropping of packets based onclass priority and policy. As the situation improves and the LowWatermark is approached, egress control messages back the offendingnetwork processors allow for resumption of normal TCP window sizes.

For example, in FIG. 18, the egress Network Processor is NP1 1802 andthe offending Network Processors are NP2 1804 and NP4 1808. NP2 and NP4were determined to be offending NPs based the High Watermark and each oftheir policies. NP1, detecting the offending NPs, sends flow controlmessages to each of the processors. These offending processors shouldperform flow control as described previously. If the Hard Limit isreached in NP1, then packets received by NP2 or NP4 can be droppedintelligently (in a manner that can be determined by the implementer).

3.5.2 Flow Thru Vs. Buffering

There will be a distinct differentiation in performance between theflow-thru and the other slower paths of processing.

3.5.2.1 Flow Thru

Fast path processing will be defined as flow-thru. This path will notinclude buffering. Packets in this path must be designated as flow-thruwithin the first N bytes (Current thinking is M ports for the IXP-1200).These types of packets will be forwarded directly to the destinationprocessor to then be forwarded out of the box. Packets that are eligiblefor flow-thru include flows that have a IFF table entry, Layer 2switchable packets, packets from the servers to clients, and FCswitchable frames.

3.5.2.2 Buffering

Packets that require further processing will need to be buffered andwill take one of 2 paths.

Buffered Fast Path

First buffered path is taken on packets that require further lookinginto the frame. These frames will need to be buffered in order that moreof the packet can be loaded into a micro-engine for processing. Theseinclude deep processing of layer 4-7 headers, load balancing and QoSprocessing.

Slow Path

The second buffered path occurs when, during processing in amicro-engine, a determination is made that more processing needs tooccur that can't be done in a micro-engine. These packets requirebuffering and will be passed to the NP co-processor in that form. Whenthis condition has been detected the goal will be to process as much aspossible in the micro-engine before handing it up to the co-processor.This will take advantage of the performance that is inherent in amicro-engine design.

4. SRC NAS

The Pirus Networks 1st generation Storage Resource Card (SRC) isimplemented with 4 occurrences of a high performance embedded computingkernel. A single instance of this kernel can contain the componentsshown in FIG. 19.

Software Features

The SRC Phase 1 NAS software load will provide NFS server capability.Key requirements include:

-   -   High performance—no software copies on read data, caching    -   High availability—balancing, mirroring

4.1 SRC NAS Storage Features

4.1.1 Volume Manager

A preferred practice of the Pirus Volume Manager provides support forcrash recovery and resynchronization after failure. This module willinteract with the NFS mirroring service during resynchronizationperiods. Disk Mirroring (RAID-1), hot sparing, and striping (RAID-0) arealso supported.

4.1.2 Disk Cache

Tightly coupled with the Volume Manager 2002, a Disk Cache module 2004will utilize the large pool of buffer RAM to eliminate redundant diskaccesses. Object based caching (rather than page-based) can be utilized.Disk Cache replacement algorithms can be dynamically tuned based uponperceived role. Database operations (frequent writes) will benefit froma different cache model than html serving (frequent reads).

4.1.3 SCSI

Initiator mode support required in phase 1. This layer will be tightlycoupled with the Fibre Channel controller device. Implementers will wishto verify the interoperability of this protocol with several currentgeneration drives (IBM, Seagate), JBODs, and disk arrays.

4.1.4 Fibre Channel

The disclosed system will provide support for fabric node (N_PORT) andarbitrated loop (NL_PORT). The Fibre Channel interface device willprovide support for SCSI initiator operations, with interoperability ofthis interface with current generation FC Fabric switches (such as thosefrom Brocade, Ancor). Point-to-Point mode can also be supported; and itis understood that the device will perform master mode DMA to minimizeprocessor intervention. It is also to be understood that the inventionwill interface and provide support to systems using NFS, RPC (RemoteProcedure Call), MNT, PCNFSD, NLM, MAP and other protocols.

4.1.5 Switch Fabric Interface

A suitable switch fabric interface device driver is left to theimplementer. Chained DMA can be used to minimize CPU overhead.

4.2 NAS Pirus System Features

4.2.1 Configuration/Statistics

The expected complement of parameters and information will be availablethrough management interaction with the Pirus chassis MIC controller.

4.2.2 NFS Load Balancing

The load balancing services of the LIC are also used to balance requestsacross multiple identical NFS servers within the Pirus chassis. NFS dataread balancing is a straightforward extension to planned services whenPirus NFS servers are hidden behind a NAT barrier.

With regard to NFS data write balancing, when a LIC receives NFS create,write, or remove commands they must be multicast to all participatingNFS SRC servers that are members of the load balancing group.

4.2.3 NFS Mirroring Service

The NFS mirroring service is responsible for maintaining the integrityof replicated NFS servers within the Pirus chassis. It coordinates theinitial mirrored status of peer NFS servers upon user configuration.This service also takes action when a load-balancer notifies it that apeer NFS server has fallen out of the group or when a new disk “checksin” to the chassis.

This service interacts with individual SRC Volume Manager modules tosynchronize file system contents. It could run on a #9 processorassociated with any SRC module or on the MIC.

5. SRC Mediation

Storage Mediation is the technology of bridging between storage mediumsof different types. We will mediate between Fibre Channel target andinitiators and IP based target and initiators. The disclosed embodimentwill support numerous mediation techniques.

5.1 Supported Mediation Protocols

Mediation protocols that can be supported by the disclosed architecturewill include Cisco's SCSI/TCP, Adaptec's SEP protocol, and the standardcanonical SCSI/UDP encapsulation.

5.1.1 SCSI/UDP

SCSI/UDP has not been documented as a supported encapsulated techniqueby any hardware manufacturer. However UDP has some advantages in speedwhen comparing it to TCP. UDP however is not a reliable transport.Therefore it is proposed that we use SCSI/UDP to extend the FibreChannel fabric through our own internal fabric (see FIG. 21demonstrating SCSI/UDP operation with elements 100, IBM 2102 and DiskArray 2104). The benefit to UDP is lower processing and latency.Reliable UDP (Cisco protocol) may also be used in the future if we wantto extend the protocol to the LAN or the WAN.

5.2 Storage Components

The following discussion refers to FIG. 22, which depicts softwarecomponents for storage (2202 et seq.).

5.2.1 SCSI/IP Layer

The SCSI/IP layer is a full TCP/IP stack and application softwarededicated to the mediation protocols. This is the layer that willinitiate and terminate SCSI/IP requests for initiators and targetsrespectively.

5.2.2 SCSI M Diator

The SCSI mediator acts as a SCSI server to incoming IP payload.

This thin module maps between IP addresses and SCSI devices and LUNs.

5.2.3 Volume Manager

The Pirus Volume Manager will provide support for disk formatting,mirroring (RAID-1) and hot spare synchronization. Striping (RAID-0) mayalso be available in the first release. The VM must be bulletproof inthe HA environment. NVRAM can be utilized to increase performance bycommitting writes before they are actually delivered to disk.

When the Volume manager is enabled a logical volume view is presented tothe SCSI mediator as a set of targetable LUNs. These logical volumes donot necessarily correspond to physical SCSI devices and LUNs.

5.2.4 SCSI Originator

In the disclosed architecture this layer will be tightly coupled withthe Fibre Channel controller device, with interoperability of thisprotocol with several current generation drives (IBM, Seagate), JBODs,and disk arrays. This module can be identical to its counterpart in theSRC NAS image.

5.2.5 SCSI Target

SCSI target mode support will be required if external FC hosts arepermitted to indirectly access remote SCSI disks via mediation (e.g.

SCSI/FC→SCSI/FC via SCSI/TCP).

5.2.6 Fibre Channel

In the disclosed embodiments, support will be provided for fabric node(N_PORT) and arbitrated loop (NL_PORT). The Fibre Channel interfacedevice will provide support for SCSI initiator or target operations.Interoperability of this interface with current generation FC Fabricswitches (Brocade, Ancor) must be assured. Point-to-Point mode must alsobe supported. This module should be identical to its counterpart in theSRC NAS image.

5.3 Mediation Example

FIG. 23 depicts an FC originator communicating with an FC Target(elements 2302 et seq), as follows:

ORIGINATOR˜sends a SCSI Read Command to TARGET^

-   -   1. Each Originator/Target pair complete their LIP Sequence. Each        750 is notified of the existence of the Originator˜/Target^.    -   2. 750˜generates an IP command that tells IXP˜ to make a        connection to IXP^.    -   3. 750^ A generates an IP command to tell IXP^ to make Target^        ‘visible’ over IP.    -   4. Originator˜issues a SCSI READ CDB to Target˜. Target˜sends        CDB to 750˜.    -   5. 750˜builds SCSI/IP request with CDB and issues it to IXP˜.    -   6. IXP˜sends packet to IXP^.    -   7. IXP^ sends IP packet to 750^.    -   8. 750^ A removes SCSI CDB from IP packet and issues SCSI CDB        request to Originator^ (memory for READ COMMAND has been        allocated).    -   9. Originator^ issue FCP_CMND to Target^.    -   10. When command is complete Target^ sends FCP_RSP to        Originator^. Originator^ notifies 750^ with good status.    -   11. 750^ packages data and status into IP packets sends to IXP^.    -   12. IXP^ sends data and status to IXP˜.    -   13. IXP˜sends IP packets with data and status 750˜.    -   14. 750˜allocates buffer spaces, dumps data in to buffers and        requests Target^ to send data and response to Originator˜.

III. NFS Load Balancing

An object of load balancing is that several individual servers are madeto appear as a single, virtual, server to a client(s). An overview isprovided in FIG. 24, including elements 2402 et seq. In particular, theclient makes file system requests to a virtual server. These requestsare then directed to one of the servers that make up the virtual server.The file system requests can be broken into two categories;

-   -   1) reads, or those requests that do not modify the file system;        and    -   2) writes or those requests that do change the file system.        Read requests do not change the file system and thus can be sent        to any of the individual servers that make up the virtual        server. Which server a request is sent to is determined by one        of several possible load balancing algorithms. This spreads the        requests across several servers resulting in an improvement in        performance over a single server. In addition, it allows the        performance of a virtual server to be scaled simply by adding        more physical servers.

Some of the possible load balancing algorithms are:

-   -   1. Round Robin where each request is sent to sequentially to the        next server.    -   2. Weighted access where requests are sent to servers based on a        percentage formula, e.g. 15% of the requests go to server A, 35%        to server B, and 50% to server C. These Weighting factors can be        fixed, or be dynamic based on such factors as server response        time.    -   3. File handle where requests for files that have been accessed        previously are directed back to the server that originally        satisfied the request. This increases performance by increasing        the likelihood that the file will be found in the server's        cache.

Write requests are different from read requests in that they must bebroadcast to each of the individual servers so that the file systems oneach server stay in sync. Thus, each write request generates severalresponses, one from each of the individual servers. However, only oneresponse is sent back to the client.

An important way to improve performance is to return to the client thefirst positive response from any of the servers instead of waiting forall the server responses to be received. This means the client sees thefastest server response instead of the slowest. A problem can arise ifall the servers do not send the same response, for example one of theservers fails to do the write while all the others are successful. Thisresults in the server's file systems becoming un-sychronized. In orderto catch and fix un-synchronized file systems, each outstanding writerequest must be remembered and the responses from each of the serverskept track of.

The file handle load balancing algorithm works well for directingrequests for a particular file to a particular server. This increasesthe likelihood that the file will be found in the server's cache,resulting in a corresponding increase in performance over the case wherethe server has to go out to a disk. It also has the benefit ofpreventing a single file from being cached on two different servers,which uses the servers' caches more efficiently and allows more files tobe cached. The algorithm can be extended to cover the case where a fileis being read by many clients and the rate at which it is served tothese clients could be improved by having more than one server servethis file. Initially a file's access will be directed to a singleserver. If the rate at which the file is being accessed exceeds acertain threshold another server can be added to the list of serversthat handle this file. Successive requests for this file can be handledin a round robin fashion between the servers setup to handle the file.Presumably the file will end up in the caches of both servers. Thisalgorithm can handle an arbitrary number of servers handling a singlefile.

The following discussion describes methods and apparatus for providingNFS server load balancing in a system utilizing the Pirus box, andfocuses on the process of how to balance file reads across severalservers.

As illustrated in FIG. 24, NFS load balancing is done so that multipleNFS servers can be viewed as a single server. An NFS client issuing anNFS request does so to a single NFS IP address. These requests arecaptured by the NFS load balancing functionality and directed towardspecific NFS servers. The determination of which server to send therequest to is based on two criteria, the load on the server and whetherthe server already has the file in cache.

The terms “SA” (the general purpose StrongArm processor that residesinside an IXP) and “Micro-engine” (the Micro-coded processor in the IXPare used herein. In one embodiment of the invention, there are 6 in eachIXP.)

As shown in the accompanying diagrams and specification, the inventionutilizes “workload distribution” methods in conjunction with amultiplicity of NFS (or other protocol) servers. Among these methods(generically referred to herein as “load balancing”) are methods of“server load balancing” and “content aware switching”.

A preferred practice of the invention combines both “Load Balancing” and“Content Aware Switching” methods to distribute workload within a fileserver system. A primary goal of this invention is to provide scalableperformance by adding processing units, while “hiding” this increasedsystem complexity from outside users.

The two methods used to distribute workload have different butcomplimentary characteristics. Both rely on the common method ofexamining or interpreting the contents of incoming requests, and thenmaking a workload distribution decision based on the results of thatexamination.

Content Aware Switching presumes that the multiplicity of servers handledifferent contents; for example, different subdirectory trees of acommon file system. In this mode of operation, the workload distributionmethod would be to pass requests for (e.g.) “subdirectory A” to oneserver, and “subdirectory B” to another. This method provides a fairdistribution of workload among servers, given a statistically largepopulation of independent requests, but can not provide enhancedresponse to a large number of simultaneous requests for a small set offiles residing on a single server.

Server Load Balancing presumes that the multiplicity of servers handlesimilar content; for example, different RAID 1 replications of the samefile system. In this mode of operation, the workload distribution methodwould be to select one of the set of available servers, based oncriteria such as the load on the server, its availability, and whetherit has the requested file in cache. This method provides a fairdistribution of workload among servers, when there are many simultaneousrequests for a relatively small set of files.

These two methods may be combined, with content aware switchingselecting among sets of servers, within which load balancing isperformed to direct traffic to individual servers. As a separateinvention, the content of the servers may be dynamically changed, forexample by creating additional copies of commonly requested files, toprovide additional server capacity transparently to the user.

As shown in the accompanying diagrams and specification, one element ofthe invention is the use of multiple computational elements, e.g.Network Processors and/or Storage CPUs, interconnected with a high speedconnection network, such as a packet switch, crossbar switch, or sharedmemory system. The resultant tight, low latency coupling facilitates thepassing of necessary state information between the traffic distributionmethod and the file server method.

1. Operation

1.1 Read Requests

Referring now to FIGS. 25 and 26, the following is the sequence ofevents that occurs in one embodiment of the invention, when an NFS READ(could also include other requests like LOOKUP) request is received.

-   -   1. A Micro-engine receives a packet on one of its ports from an        NFS client that contains a READ request to the NFS domain.    -   2. The Micro-engine uses the file handle contained in the        request to perform a lookup in a file handle hash table.    -   3. The hash lookup results in a pointer to a file handle entry        (we'll assume a hit for now).    -   4. In the hash table is the IP address for the specific NFS        server the request should be directed to. Presumably this NFS        server should have the file in its cache and thus be able to        serve it up more quickly than one that does not.    -   5. The destination IP address of the packet with the READ        request is updated with the server IP address and then forwarded        to the server.

A hash table entry can have more than one NFS server IP address. Thisallows a file that is under heavy access to exist in more than one NFSserver cache and thus to be served up by more than one server. Theselection of which specific server to direct a specific READ request tocan be determined, but could be as simple as a round robin.

1.2 Determining the Number of Servers for a File

The desired behavior is that:

-   -   1. Files that are lightly accessed, i.e. have a low number of        accesses per second, only need to be served by a single server.    -   2. Files that are heavily accessed are served by more than one        server.    -   3. Accesses to a file are directed to the same server, or set of        servers if it is being heavily accessed, to keep accesses        directed to those servers that have that file in its cache.

1.3 Server Lists

In addition to being able to be looked up using the file handle hashtable, file handle entries can be placed on doubly linked lists. Therecan be a number of such linked lists. Each list has the file handleentries on it that have a specific number of servers serving them. Thereis a list for file handle entries that have only one server servingthem. Thus, as shown in FIG. 27, for example, there might a total ofthree lists; a single server list, a two-server list and a four-serverlist. The single server list has entries in it that are being served byone server, the two-server list is a list of the entries being served bytwo servers, etc.

File handle entries are moved from list to list as the frequency ofaccess increases or decreases.

1.3.1 Single Serv r List

All the file handle entries begin on the single server list. When a READrequest is received the file handle in the READ is used to access thehash table. If there is no entry for that file handle a free entry istaken from the entry free list and a single server is selected to servethe file, by some criteria such as least loaded, fastest responding orround robin. If no entries are free then a server is selected and therequest is sent directly to it without an entry being filled out. Once anew entry is filled out it is added to the hash table and placed at thetop of the single server list queue.

Periodically, a process check the free list and if it is close to emptyit will take some number of entries off the bottom of the single serverlist, remove them from hash table and then place them back on the freelist. This keeps the free list replenished.

Since entries are placed on the top of the list and taken off from thebottom, each entry spends a certain amount of time on the list, whichvaries according to rate at which new file handle READ requests occur.During the period of time that an entry exists on the list it has theopportunity to be hit by another READ access. Each time a hit occurs acounter is bumped in the entry. If an entry receives enough hits whileit is on the list to exceed a pre-defined threshold it is deemed to haveenough activity to it to deserve to have more servers serving it. Suchan entry is then taken off the single server list, additional serversselected to serve the file, and then placed on one of the multipleserver lists.

In the illustrated embodiment of the invention, it is expected that themicro-engines will handle the lookup and forwarding of requests to theservers, and that the SA will handle all the entry movements betweenlists and adding and removing them from the hash table. However, otherdistributions of labor can be utilized.

1.3.2 Multiple Server Lists

In addition to the single server list, there are multiple server lists.Each multiple server list contains the entries that are being served bythe same number of servers. Just like with entries on the single serverlist, entries on the multiple server lists get promoted to the top ofthe next list when their frequency of access exceeds a certainthreshold. Thus a file that is being heavily accessed might move fromthe single server list, to the dual server list and finally to the quadserver list.

When an entry moves to a new list it is added to the top of that list.Periodically, a process will re-sort the list by frequency of access. Asa file becomes less frequently accessed it will move toward the bottomof its list. Eventually the frequency of access will fall below acertain threshold and the entry will be placed on the top of theprevious list, e.g. an entry might fall off the quad server list and beput on the dual server list. During this demotion process the number ofservers serving this file will be reduced.

1.4 Synchronizing Lists Across Multiple IXP's

The above scheme works well when one entity, i.e., an IXP, sees all thefile READ requests. However, this will not be the case in most systems.In order to have the same set of servers serving a file information mustbe passed between IXP's that have the same file entry. This informationneeds to be passed when an entry is promoted or demoted between lists,as this is when servers are added or taken away.

When an entry is going to be promoted by an IXP it first broadcasts toall the other IXP's asking for their file handle entries for the filehandle of the entry it wants to promote. When it receives the entriesfrom the other IXP's it looks to see whether one of the other IXP's hasalready promoted this entry. If it has, it adds the new servers fromthat entry. If not, it selects new servers based on some TBD criteria.

Demotion of an entry from one list to the other works much the same way,except that when the demoting IXP looks at the entries from the otherIXP's it looks for entries that have less servers than its entrycurrently does. If there are any then it selects those servers. Thiskeeps the same set of servers serving a file even as fewer of them areserving it. If there are no entries with fewer servers, then the IXP canuse one or more criteria to remove the needed number of servers from theentry.

There are advantages to making load balancing decisions based uponfilehandle information. When the inode portion of the filehandle is usedto select a unique target NAS server for information reads, a maximallydistributed cache is achieved. When an entire NAS working set of filesfits in any one cache then a lowest latency response system is createdby allowing all working set files to be simultaneously inside every NASserver's cache. Load balancing is then best performed using around-robin policy.

Pirus NAS servers will provide cache utilization feedback to an IXP loadbalancer. The LB can use this feedback to dynamically shift betweenmaximally distributed caching and round-robin balancing for smallerworking sets. These processes are depicted in FIGS. 25 and 26 (NFSReceive Micro-Code Flowchart and NFS Transmit Micro-Code Flowchart).

IV. Intelligent Forwarding and Filtering

The following discussion describes certain Pirus box functions referredto as intelligent forwarding and filtering (IFF). IFF is optimized tosupport the load balancing function described elsewhere herein. Hence,the following discussion contains various load balancing definitionsthat will facilitate an understanding of IFF.

As noted elsewhere herein, the Pirus box provides load-balancingfunctions, in a manner that is transparent to the client and server.Therefore, the packets that traverse the box do not incur a hop count asthey would, for example, when traversing a router. FIG. 28 isillustrative. In FIG. 28, Servers 1, 2, and 3 are directly connected tothe Pirus box (denoted by the pear icon), and packets forwarded to themare sent to their respective MAC addresses. Server 4 sits behind arouter and packets forwarded to it are sent to the MAC address of therouter interface that connects to the Pirus box. Two upstream routersforward packets from the Internet to the Pirus box.

1. Definitions

The following definitions are used in this discussion:

A Server Network Processor (SNP) provides the functionality for portsconnected to servers. Packets received from a server are processed anSNP.

A Router Network Processor (RNP) provides the functionality for portsconnected to routers or similar devices. Packets received from a routerare processed an RNP.

In accordance with the invention, an NP may support the role of RNP andSNP simultaneously. This is likely to be true, for example, on 10/100Ethernet modules, as the NP will server many ports, connected to bothrouters and servers.

An upstream router is the router that connects the Internet to the Pirusbox.

2. Virtual Domains

As used herein, the term “virtual domain” denotes a portion of a domainthat is served by the Pirus box. It is “virtual” because the entiredomain may be distributed throughout the Internet and a globalload-balancing scheme can be used to “tie it all together” into a singledomain.

In one practice of the invention, defining a virtual domain on a Pirusbox requires specifying one or more URLs, such as www.fred.com, and oneor more virtual IP addresses that are used by clients to address thedomain. In addition, a list of the IP addresses of the physical serversthat provide the content for the domain must be specified; the Pirus boxwill load-balance across these servers. Each physical server definitionwill include, among other things, the IP address of the server and,optionally, a protocol and port number (used for TCP/UDP portmultiplexing—see below).

For servers that are not directly connected to the Pirus box, a route,most likely static, will need to be present; this route will containeither the IP address or IP subnet of the server that is NOT directlyconnected, with a gateway that is the IP address of the router interfacethat connects to the Pirus box to be used as the next-hop to the server.

The IP subnet/mask pairs of the devices that make up the virtual domainshould be configured. These subnet/mask pairs indirectly create a routetable for the virtual domain. This allows the Pirus box to forwardpackets within a virtual domain, such as from content servers toapplication or database servers. A mask of 255.255.255.255 can be usedto add a static host route to a particular device.

The Pirus box may be assigned an IP address from this subnet/mask pair.This IP address will be used in all IP and ARP packets authored by thePirus box and sent to devices in the virtual domain. If an IP address isnot assigned, all IP and ARP packets will contain a source IP addressequal to one of the virtual IP addresses of the domain. FIG. 29 isillustrative. In FIG. 29, the Pirus box is designated by numeral 100.Also in FIG. 29, the syntax for a port is <slot number>.<port number>)ports 1.3, 2.3, 3.3, 4.3, 5.1 and 5.3 are part of the same virtualdomain. Server 1.1.1.1 may need to send packets to Cache 1.1.1.100. Eventhough the Cache may not be explicitly configured as part of the virtualdomain, configuring the virtual domain with an IP subnet/mask of1.1.1.0/255.255.255.0 will allow the servers to communicate with thecache. Server 1.1.1.1 may also need to send packets to Cache192.168.1.100. Since this IP subnet is outside the scope of the virtualdomain (i.e., the cache, and therefore the IP address, may be owned bythe ISP), a static host route can be added to this one particulardevice.

2.1 Network Address Translation

In one practice of the invention, Network Address Translation, or NAT,is performed on packets sent to or from a virtual IP address. In FIG. 29above, a client connected to the Internet will send a packet to avirtual IP address representing a virtual domain. The load-balancingfunction will select a physical server to send the packet to. NATresults in the destination IP address (and possibly the destinationTCP/UDP port, if port multiplexing is being used) being changed to thatof the physical server. The response packet from the server also has NATperformed on it to change the source IP address (and possibly the sourceTCP/UDP port) to that of the virtual domain.

NAT is also performed when a load-balanceable server sends a requestthat also passes through the load-balancing function, such as an NFSrequest. In this case, the server assumes the role of a client.

3. VLAN Definition

It is contemplated that since the Pirus box will have many physicalports, the Virtual LAN (VLAN) concept will be supported. Ports thatconnect to servers and upstream routers will be grouped into their ownVLAN, and the VLAN will be added to the configuration of a virtualdomain.

In one practice of the invention, a virtual domain will be configuredwith exactly one VLAN. Although the server farms comprising the virtualdomain may belong to multiple subnets, the Pirus box will not be routing(in a traditional sense) between the subnets, but will be performing aform of L3 switching. Unlike today's L3 switch/routers that switchframes within a VLAN at Layer 2 and route packets between VLANs at Layer3, the Pirus box will switch packets using a combination of Layer 2 andLayer 3 information. It is expected that the complexity of routingbetween multiple VLANs will be avoided.

By default, packets received on all ports in the VLAN of a virtualdomain are candidates for load balancing. On Router ports (see 4.4.1,Router Port), these packets are usually HTTP or FTP requests. On Serverports (see 4.4.2, Server Port), these packets are usually back-endserver requests, such as NFS.

All packets received by the Pirus box are classified to a VLAN and are,hence, associated with a virtual domain. In some cases, thisclassification may be ambiguous because, with certain constraints, aphysical port may belong to more than one VLAN. These constraints arediscussed below.

3.1 Default VLAN

In one practice of the invention, by default, every port will beassigned to the Default VLAN. All non-IP packets received by the Pirusbox are classified to the Default VLAN. If a port is removed from theDefault VLAN, non-IP packets received on that port are discarded, andnon-IP packets received on other ports will not be sent on that port. Inaccordance with this practice of the invention, all non-IP packets willbe handled in the slow path. This CPU will need to build and maintainMAC address tables to avoid flooding all received packets on the DefaultVLAN. The packets will be forwarded to a single CPU determined by anelection process. This avoids having to copy (potentially large)forwarding tables between slots but may result in each packet traversingthe switch fabric twice.

3.2 Server Administration VLAN

Devices connected to ports on the Server Administration VLAN can managethe physical servers in any virtual domain. By providing only this formof inter-VLAN routing, the system can avoid having to add ServerAdministration ports (see below) to the VLANs of every virtual domainthat the server administration stations will manage.

3.3 Server Access VLAN

A Server Access VLAN is used internally between Pirus boxes. A Pirus boxcan make a load-balancing decision to send a packet to a physical serverthat is connected to another Pirus box. The packet will be sent on aServer Access VLAN that, unlike packets received on Router ports, maydirectly address physical servers. See the discussion of Load Balancingelsewhere herein for additional information on how this is used.

3.4 Port Types

3.4.1 Router Port

In one embodiment of the invention, one or more Router ports will beadded to the VLAN configuration of a virtual domain. Note that a Routerport is likely to be carrying traffic for many virtual domains.

Classifying a packet received on a Router port to a VLAN of a virtualdomain is done by matching the destination IP address to one of thevirtual IP addresses of the configured virtual domains.

ARP requests sent by the Pirus box to determine the MAC address andphysical port of the servers that are configured as part of a virtualdomain are not sent out Router ports. If a server is connected to thesame port as an upstream router, the port must be configured as a Comboport (see below).

3.4.2 Server Port

Server ports connect to the servers that provide the content for avirtual domain. A Server port will most likely be connected to a singleserver, although it may be connected to multiple servers.

Classifying a packet received on a Server port to a VLAN of a virtualdomain may require a number of steps.

-   1. using the VLAN of the port if the port is part of a single VLAN-   2. matching the destination IP address and TCP/UDP port number to    the source of a flow (i.e., an HTTP response)-   3. matching the destination IP address to one of the virtual IP    addresses of the configured virtual domains (i.e., an NFS request)

The default and preferred configuration is for a Server port to be amember of a single VLAN. However, multiple servers, physical or logical,may be connected to the same port and be in different VLANs only if thepackets received on that port can unambiguously be associated with oneof the VLANs on that port.

One way for this is to use different IP subnets for all devices on theVLANs that the port connects to. TCP/UDP port multiplexing is oftenconfigured with a single IP address on a server and multiple TCP/UDPports, one per virtual domain. It is preferable to also use a differentIP address with each TCP/UDP port, but this is necessary only if thesingle server needs to send packets with TCP/UDP ports other than theones configured on the Pirus box.

In FIG. 30, the physical server with IP address 1.1.1.4 provides HTTPcontent for two virtual domains, www.larry.com and www.curly.com.TCP/UDP port multiplexing is used to allow the same server to providecontent for both virtual domains. When the Pirus box load balancespackets to this server, it will use NAT to translate the destination IPaddress to 1.1.1.4 and the TCP port to 8001 for packets sent towww.larry.com and 8002 for packets sent to www.curly.com.

Packets sent from this server with a source TCP port of 8001 or 8002 canbe classified to the appropriate domain. But if the server needs to sendpackets with other source ports (i.e., if it needs to perform an NFSrequest), it is ambiguous as to which domain the packet should bemapped.

The list of physical servers that make up a domain may requiresignificant configuration. The IP addresses of each must be entered aspart of the domain. To minimize the amount of information that theadministrator must provide, the Pirus box determines the physical portthat connects to a server, as well as its MAC address, by issuing ARPrequests to the IP addresses of the servers. The initial ARP requestsare only sent out Server and Combo ports. The management software mayallow the administrator to specify the physical port to which a serveris attached. This restricts the ARP request used to obtain the MACaddress to that port only.

A Server port may be connected to a router that sits between the Pirusbox and a server farm. In this configuration, the VLAN of the virtualdomain must be configured with a static route of the subnet of theserver farm that points to the IP address of the router port connectedto the Pirus box. This intermediate router needs a route back to thePirus box as well (either a default route or a route to the virtual IPaddress(es) of the virtual domain(s) served by the server farm.

3.4.3 Combo Port

A Combo port, as defined herein, is connected to both upstream routersand servers. Packet VLAN classification first follows the rules forRouter ports then Server ports.

3.4.4 Server Administration Port

A Server Administration port is connected to nodes that administerservers. Unlike packets received on a Router port, packets received on aServer Administration port can be sent directly to servers. Packets canalso be sent to virtual IP addresses in order to test the load-balancingfunction.

A Server Administration port may be assigned to a VLAN that isassociated with a virtual domain, or it may be assigned to the ServerAdministration VLAN. The former is straightforward—the packets areforwarded only to servers that are part of the virtual domain. Thelatter case is more complicated, as the packets received on the ServerAdministration port can only be sent to a particular server if thatserver's IP address is unique among all server IP addresses known to thePirus box. This uniqueness requirement also applies if the same serveris in two different virtual domains with TCP/UDP port multiplexing.

3.4.5 Server Access Port

A Server Access port is similar to a trunk port on a conventional Layer2 switch. It is used to connect to another Pirus box and carry “tagged”traffic for multiple VLANs. This allows one Pirus box to forward apacket to a server connected to another Pirus box.

The Pirus box will use the IEEE 802.1 Q VLAN trunking format. A VLAN IDwill be assigned to the VLAN that is associated with the virtual domain.This VLAN ID will be carried in the VLAN tag field of the 802.1Q header.

3.4.6 Example of VLAN

FIG. 30 is illustrative of a VLAN. Referring now to FIG. 30, the Pirusbox, designated by the pear icon, is shown with 5 slots, each of whichhas 3 ports. The VLAN configuration is as follows (the syntax for a portis <slot number>.<port number>):

-   -   VLAN 1        -   Server ports 1.1, 2.1, 3.1 and 4.3 (denoted in picture by a            dotted line)        -   Router port 4.1 (denoted in picture by a heavy solid line)    -   VLAN 2        -   Server ports 1.2, 2.2, 3.2 and 4.3 (denoted in picture by a            dashed line)        -   Server Administration port 5.2        -   Router port 4.1 (denoted in picture by a heavy solid line)    -   VLAN 3        -   Server ports 1.3, 2.3, 3.3 and 4.3 (denoted in picture by a            solid line)        -   Server Administration port 5.3        -   Router port 4.1 (denoted in picture by a heavy solid line)    -   Server Administration VLAN        -   Server Administration port 5.1 (denoted in picture by wide            area link)

An exemplary virtual domain configuration is as follows:

-   -   Virtual domain www.moe.com        -   Virtual IP address 100.1.1.1        -   VLAN 1            -   Server 2.1.1.1            -   Server 2.1.1.2            -   Server 2.1.1.3            -   Server 2.1.1.4    -   Virtual domain www.larry.com        -   Virtual IP address 200.1.1.1        -   VLAN 2            -   Server 1.1.1.1            -   Server 1.1.1.2            -   Server 1.1.1.3            -   Server 1.1.1.4 Port 8001    -   Virtual domain www.curly.com        -   Virtual IP address 300.1.1.1        -   VLAN 3            -   Server 1.1.1.1            -   Server 1.1.1.2            -   Server 1.1.1.3            -   Server 1.1.1.4 Port 8002

Domain www.larry.com and www.curly.com each have a VLAN containing 3servers with the same IP addresses: 1.1.1.1, 1.1.1.2 and 1.1.1.3. Thisfunctionality allows different customers to have virtual domains withservers using their own private address space that doesn't need to beunique among all the servers known to the Pirus box. They also containthe same server with IP address 1.1.1.4. Note the Port number in theconfiguration. This is an example of TCP/UDP port multiplexing, wheredifferent domains can use the same server, each using a unique portnumber. Domain www.moe.com has servers in their own address space,although server 2.1.1.4 is connected to the same port (4.3) as server1.1.1.4 shared by the other two domains.

The administration station connected to port 5.2 is used to administerthe servers in the www.larry.com virtual domain, and the stationconnected to 5.3 is used to administer the servers in the www.curly.comdomain. The administration station connected to port 5.1 can administerthe servers in www.moe.com.

4. Filtering Function

The filtering function of an RNP performs filtering on packets receivedfrom an upstream router. This ensures that the physical serversdownstream from the Pirus box are not accessed directly from clientsconnected to the Internet

5. Forwarding Function

The Pirus box will track flows between IP devices. A flow is abi-directional conversation between two connected IP devices; it isidentified by a source IP address, source UDP/TCP port, destination IPaddress, and destination TCP/UDP port.

A single flow table will contain flow entries for each flow through thePirus box. The forwarding entry content, creation, removal and use arediscussed below.

5.1 Flow Entry Description

A flow entry describes a flow and the information necessary to reach theendpoints of the flow. A flow entry contains the following information:

Attribute # of bytes Description Source IP address 4 Source IP addressDestination IP address 4 Destination IP address Source TCP/UDP port 2Source higher layer port Destination TCP/UDP port 2 Destination higherlayer port Source physical port 2 Physical port of the source Sourcenext-hop MAC address 6 The MAC address of next- hop to sourceDestination physical port 2 Physical port of the destination Destinationnext-hop MAC 6 MAC address of next- hop to address destination NAT IPaddress 4 Translation IP address NAT TCP/UDP port 2 Translation higherlayer port Flags 2 Various flags Received packets 2 No. packets receivedfrom source IP address Transmitted packets 2 No. of packets sent to thesource IP address Received bytes 4 No. of bytes received from source IPaddress Transmitted bytes 4 No. of bytes sent to source IP address Nextpointer (receive path) 4 Pointer to next forwarding entry in hash tableused in the receive path Next pointer (transmit path) 4 Pointer to nextforwarding entry in the hash table used in the transmit path Transmitpath key 4 Smaller key unique among all flow entries Total 60

In accordance with the invention, the IP addresses and TCP/UDP ports ina flow entry are relative to the direction of the flow. Therefore, aflow entry for a flow will be different in the flow tables that handleeach direction. This means a flow will have 2 different flow entries,one on the NP that connects to the source of the flow and one on the NPthat connects to the destination of the flow. If the same NP connects toboth the source and destination, then that NP will contain 2 flowentries for the flow.

In one practice of the invention, on an RNP, the first four attributesuniquely identify a flow entry. The source and destination IP addressesare globally unique in this context since they both represent reachableInternet addresses.

On an SNP, the fifth attribute is also required to uniquely identify aflow entry. This is best described in connection with the example shownin FIG. 31. As shown therein, a mega-proxy, such as AOL, performs NAT onthe source IP address and TCP/UDP port combinations from the clientsthat connect them. Since a flow is defined by source and destination IPaddress and TCP/UDP port, the proxy can theoretically reuse the samesource IP address and TCP/UDP port when communicating with differentdestinations. But when the Pirus box performs load balancing and NATfrom the virtual IP address to a particular server, the destination IPaddresses and TCP/UDP port of the packets may no longer be unique to aparticular flow. Therefore, the virtual domain must be included in thecomparison to find the flow entry. Requiring that the IP addressesreachable on a Server port be unique across all virtual domains on thatport solves the problem. The flow entry lookup can also compare thesource physical port of the flow entry with the physical port on whichthe packet was received.

A description of the attributes is as follows:

5.1.1 Source IP address: The source IP address of the packet. SourceTCP/UDP port: The source TCP/UDP port number of the packet.

5.1.2 Destination IP address: The destination IP address of the packet.

5.1.3 Destination TCP/UDP port: The destination TCP/UDP port number ofthe packet.

5.1.4 Source physical port: The physical port on the Pirus box used toreach the source IP address.

5.1.5 Source next-hop MAC address: The MAC address of the next-hop tothe source IP address. This MAC address is reachable out the sourcephysical port and may be the host that owns the IP address.

5.1.6 Destination physical port: The physical port on the Pirus box usedto reach the destination IP address.

5.1.7 Destination next-hop MAC address: The MAC address of the next-hopto the destination IP address. This MAC address is reachable out thedestination physical port and may be the host that owns the IP address.

5.1.8 NAT IP address: The IP address that either the source ordestination IP addresses must be translated to. If the source IP addressin the flow entry represents the source of the flow, then this addressreplaces the destination IP address in the packet. If the source IPaddress in the flow entry represents the destination of the flow, thenthis address replaces the source IP address in the packet.

5.1.9 NAT TCP/UDP port: The TCP/UDP port that either the source ordestination TCP/UDP port must be translated to. If the source TCP/UDPport in the flow entry represents the source of the flow, then this portreplaces the destination TCP/UDP port in the packet. If the sourceTCP/UDP port in the flow entry represents the destination of the flow,then this port replaces the source TCP/UDP port in the packet.

5.1.10 Flags: Various flags can be used to denote whether the flow entryis relative to the source or destination of the flow, etc.

5.1.11 Received packets: The number of packets received with a source IPaddress and TCP/UDP port equal to that in the flow entry.

5.1.12 Transmitted packets: The number of packets transmitted with adestination IP address and TCP/UDP port equal to that in the flow entry.

5.1.13 Received bytes: The number of bytes received with a source IPaddress and TCP/UDP port equal to that in the flow entry.

5.1.14 Transmitted bytes: The number of bytes transmitted with adestination IP address and TCP/UDP port equal to that in the flow entry.

5.1.15 Next pointer (receive path): A pointer to the next flow entry inthe linked list. It is assumed that a hash table will be used to storethe flow entries. This pointer will be used to traverse the list of hashcollisions in the hash done by the receive path (see below).

5.1.16 Next pointer (transmit path): A pointer to the next flow entry inthe linked list. It is assumed that a hash table will be used to storethe flow entries. This pointer will be used to traverse the list of hashcollisions in the hash done by the transmit path (see below).

5.2 Adding Forwarding Entries

5.2.1 Client IP Addresses

A client IP address is identified as a source IP address in a packetthat has a destination IP address that is part of a virtual domain. Aflow entry is created for client IP addresses by the load-balancingfunction. A packet received on a Router or Server port is matchedagainst the is configured policies of a virtual domain. If a physicalserver is chosen to receive the packet, a flow entry is created with thefollowing values:

Attribute Value Source IP address the source IP address from the packetDestination IP address the destination IP address from the packet SourceTCP/UDP port the source TCP/UDP port from the packet Destination TCP/UDPport the destination TCP/UDP port from the packet Source physical portthe physical port on which the packet was received Source next-hop MACaddress source MAC address of the packet Destination physical port thephysical port connected to the server Destination next-hop MAC the MACaddress of the server address NAT IP address IP address of the serverchosen by the load-balancing function NAT TCP/UDP port TCP/UDP portnumber of the chosen server.This may be different from the destination TCP/UDP port if portmultiplexing is usedFlags Can be determined

In one practice of the invention, the flow entry will be added to twohash tables. One hash table is used to lookup a flow entry given valuesin a packet received via a network interface. The other hash table isused to lookup a flow entry given values in a packet received via theswitch fabric. Both hash table index values will most likely be based onthe source and destination IP addresses and TCP/UDP port numbers.

In accordance with the invention, if the packet of the new flow isreceived on a Router port, then the newly created forwarding entry needsto be sent to the NPs of all other Router ports. The NP connected to theflow destination (most likely a Server port; could it be a Router port?)will rewrite the flow entry from the perspective of packets received onthat port that will be sent to the source of the flow:

Attribute Value Source IP address original NAT IP address Destination IPaddress original source IP address Source TCP/UDP port original NATTCP/UDP port Destination TCP/UDP port original source TCP/UDP portSource physical port original destination physical port Source next-hopMAC address original destination MAC address Destination physical portoriginal source physical port Destination next-hop MAC address originalsource MAC address NAT IP address original destination IP address NATTCP/UDP port original destination TCP/UDP port Flags Can be determined5.2.2 Virtual Domain IP Addresses

Virtual domain IP addresses are those that identify the domain (such aswww.fred.com) and are visible to the Internet. The “next hop” of theseIP addresses is the load balancing function. In one practice of theinvention, addition of these IP addresses is performed by the managementsoftware when the configuration is read.

Attribute Value IP address the virtual IP address TCP/UDP port zero ifthe servers in the virtual domain accept all TCP/UDP port numbers;otherwise, a separate forwarding entry will exist with each TCP/UDP portnumber that is supported Destination IP address zero Destination TCP/UDPport zero Physical port n/a Next-hop MAC address n/a Server IP addressn/a Server TCP/UDP port n/a Server physical port n/a Flags an indicatorthat packets destined to this IP address and TCP/UDP port are to beload-balanced5.2.3 Server IP Addresses

Server IP addresses are added to the forwarding table by the managementsoftware when the configuration is read.

The forwarding function will periodically issue ARP requests for the IPaddress of each physical server. It is beyond the scope of the IFFfunction as to exactly how the physical servers are known, be it manualconfiguration or dynamic learning. In any case, since the administratorshouldn't have to specify the port that connects to the physicalservers, this will require that the Pirus box determine it. ARP requestswill need to be sent out every port connected to an SNP until an ARPresponse is received from a server on a port. Once a server's IP addresshas been resolved, periodic ARP requests to ensure the server is stillalive can be sent out the learned port. A forwarding entry will becreated once an ARP response is received. A forwarding entry will beremoved (or marked invalid) once an entry times out.

If the ARP information for the server times out, subsequent ARP requestswill again need to be sent out all SNP ports. An exponential backofftime can be used so that servers that are turned off will not result insignificant bandwidth usage.

For servers connected to the Pirus box via a router, ARP requests willbe issued for the IP address of the router interface.

Attribute Value IP address the server's IP address TCP/UDP port TBDDestination IP address zero Destination TCP/UDP port zero Physical portn/a Server IP address n/a Server TCP/UDP port n/a Server physical portn/a Flags TBD

5.3 Distributing the Forwarding Table

In one practice of the invention, as physical servers are located, theirIP address/port combinations will be distributed to all RNPS. Likewise,as upstream routers are located, their IP address/MAC address/portcombinations will be distributed to all SNPs.

5.4 Ingress Function

It is assumed that the Ethernet frame passes the CRC check before thepacket reaches the forwarding function and that frames that don't passthe CRC check are discarded. As it is anticipated that the RNP will beheavily loaded, the IP and TCP/UDP checksum validation can be performedby the SNP. Although it is probably not useful to perform the forwardingfunction if the packet is corrupted because the data used by thosefunctions may be invalid, the process should still work.

After the load balancing function has determined a physical server thatshould receive the packet, the forwarding function performs a lookup onthe IP address of the server. If an entry is found, this forwardingtable entry contains the port number that is connected to the server,and the packet is forwarded to that port. If no entry is found, thepacket is discarded. The load balancing function should never choose aphysical server whose location is unknown to the Pirus box.

On packets received a packet from a server, the forwarding functionperforms a lookup on the IP address of the upstream router. If an entryis found, the packet is forwarded to the port contained in theforwarding entry.

The ingress function in the RNP calls the load balancing function and isreturned the following (any value of zero implies that the old valueshould be used)

-   -   1. new destination IP address    -   2. new destination port

The RNP will optionally perform Network Address Translation, or NAT, onthe packets that arrive from the upstream router. This is because thepackets from the client have a destination IP address of the domain(i.e., www.fred.com). The new destination IP address of the packet isthat of the actual server that was chosen by the load balancingfunction. In addition, a new destination port may be chosen if TCP/UDPport multiplexing is in use. Port multiplexing may be used on thephysical servers in order to conserve IP addresses. A single server mayserve multiple domains, each with a different TCP/UDP port number.

The SNP will optionally perform NAT on the packets that arrive from aserver. This is because there may be a desire to hide the details of thephysical servers that provide the load balancing function and have itappear as if the domain IP address is the “server”. The new source ofthe packet is that of the domain. As the domain may have multiple IPaddresses, the Pirus box needs a client table that maps the client's IPaddress and TCP/UDP port to the domain IP address and port to which theclient sent the original packet.

6. Egress Function

Packets received from an upstream router will be forwarded to a server.The forwarding function sends the packet to the SNP providing supportfor the server. This SNP performs the egress function to do thefollowing:

-   -   1. verify the IP checksum    -   2. verify the TCP or UDP checksum    -   3. change the destination port to that of the server (as        determined by the load balancing function call in the ingress        function)    -   4. change the destination IP address to that of the server (as        determined by the load balancing function call in the ingress        function)    -   5. recalculate the TCP or UDP checksum if the destination port        or destination IP address was changed    -   6. recalculate the IP header checksum if the destination IP        address was changed    -   7. sets the destination MAC address to that of the server or        next-hop to the server (as determined by the forwarding        function)    -   8. recalculate the Ethernet packet CRC if the destination port        or destination IP address was changed

Packets received from a server will be forwarded to an upstream router.The SNP performs the egress function to do the following:

-   -   1. verify the IP checksum    -   2. verify the TCP or UDP checksum    -   3. change the source port to the one that the client sent the        request to (as determined by the ingress function client table        lookup)    -   4. change the source IP address to the one that the client sent        the request to (as determined by the ingress function client        table lookup)    -   5. recalculate the TCP or UDP checksum if the source port or        source IP address was changed    -   6. recalculate the IP header checksum if the destination IP        address was changed    -   7. sets the destination MAC address to that of the upstream        router    -   8. recalculate the Ethernet packet CRC if the source port or        source IP address was changed

V. IP-Based Storage Management—Device Discovery & Monitoring

In data networks based on IP/Ethernet technology a set of standards hasdeveloped that permit users to manage/operate their networks using aheterogeneous collection of hardware and software. These standardsinclude Ethernet, Internet Protocol (IP), Internet Control MessageProtocol (ICMP), Management Information Block (MIB) and Simple NetworkManagement Protocol (SNMP). Network Management Systems (NMS) such as HPOpen View utilize these standards to discover and monitor networkdevices.

Storage Area Networks (SANs) use a completely different set oftechnology based on Fibre Channel (FC) to build and manage “StorageNetworks”. This has led to a “re-inventing of the wheel” in many cases.Also, SAN devices do not integrate well with existing IP-basedmanagement systems.

Lastly, the storage devices (Disks, Raid Arrays, etc), which are FibreChannel attached to the SAN devices, do not support IP (and the SANdevices have limited IP support) and the storage devices cannot bediscovered/managed by IP-based management systems. There are essentiallytwo sets of management products—one for the IP devices and one for thestorage devices.

A trend is developing where storage networks and IP networks areconverging to a single network based on IP. However, conventionalIP-based management systems can not discover FC attached storagedevices.

The following discussion explains a solution to this problem, in twoparts. The first aspect is device discovery, the second is devicemonitoring.

Device Discovery

FIG. 32 illustrates device discovery in accordance with the invention.In the illustrated configuration the NMS cannot discover (“see”) thedisks attached to the FC Switch but it can discovery (“see”) the disksattached to the Pirus System. This is because the Pirus System does thefollowing:

-   -   Assigns an IP address to each disk attached to it.    -   Creates an Address Resolution Protocol (ARP) table entry for        each disk. This is a simple table that contains a mapping        between IP and physical addresses.    -   When the NMS uses SNMP to query the Pirus System, the Pirus        System will return an ARP entry for each disk attached to it.    -   The NMS will then “ping” (send ICMP echo request) for each ARP        entry it receives from the Pirus System.    -   The Pirus System will intercept the ICMP echo requests destined        for the disks and translate the ICMP echo into a SCSI Read Block        0 request and send it to the disk.    -   If the SCSI Read Block 0 request successfully completes then the        Pirus System acknowledges the “ping” by sending back an ICMP        echo reply to the NMS.    -   If the SCSI Read Block 0 request fails then the Pirus System        will not respond to the “ping” request.

The end result of these actions is that the NMS will learn about theexistence of each disk attached to the Pirus System and verify that itcan reach it. The NMS has now discovered the device.

Device Monitoring

Once the device (disk) has been discovered by the NMS it will startsending it SNMP requests to learn what the device can do (i.e.,determine its level of functionality.) The Pirus System will interceptthese SNMP requests and generate a SCSI request to the device. Theresponse to the SCSI request will be converted back into an SNMP replyand returned to the NMS. FIG. 33 illustrates this.

The configuration illustrated in FIG. 33 is essentially an SNMP <-> SCSIconverter/translator.

Lastly, NMS can receive asynchronous events (traps) from devices. Theseare notifications of events that may or may not need attention. ThePirus System will also translate SCSI exceptions into SNMP traps, whichare then propagated to the NMS. FIG. 34 illustrates this.

VI. DATA STRUCTURE LAYOUT

Data Structure Layout: FIG. 35 shows the relationships between thevarious configuration data structures. Each data structure is describedin detail following the diagram. The data structures are not linked;however, the interconnecting lines in the diagram display referencesfrom one data structure to another. These references are via instancenumber.

Data Structure Descriptions

-   1. VSD_CFG_T: This data structure describes a Virtual Storage    Domain. Typically there is a single VSD for each end user customer    of the box. A VSD has references to VLANS that provide information    on ports allowed access to the VSD. VSE structures provide    information for the storage available to a VSD and SERVER_CFG_T    structures provide information on CPUs available to a VSD. A given    VSD may have multiple VSE and SERVER structures.-   2. VSE_CFG_T: This data structure describes a Virtual Storage    Endpoint. VSEs can be used to represent Virtual Servers (NAS) or    IP-accessible storage (ISCSI, SCSI over UDP, etc.). They are always    associated with one, and only one, VSD.-   3. VlanConfig: This data structure is used to associate a VLAN with    a VSD. It is not used to create a VLAN.-   4. SERVER_CFG_T: This data structure provides information regarding    a single CPU. It is used to attach CPUs to VSEs and VSDs. For    replicated NFS servers there can be more than one of these data    structures associated with a given VSE.-   5. MED_TARG_CFG_T: This data structure represents the endpoint for    Mediation Target configuration: a device on the FibreChannel    connected to the Pirus box being accessed via some form of SCSI over    IP.-   6. LUN_MAP _CFG_T: This data structure is used for mapping Mediation    Initiator access. It maps a LUN on the specified Pirus FC port to an    IP/LUN pair on a remote ISCSI target.-   7. FILESYS_CFG_T: This data structure is used to represent a file    system on an individual server. There may be more than one of these    associated with a given server. If this file system will be part of    a replicated NFS file system, the filesystem_id and the mount point    will be the same for each of the file systems in the replica set.-   8. SHARE_CFG_T: This data structure is used to provide information    regarding how a particular file system is being shared. The    information in this data structure is used to populate the sharetab    file on the individual server CPUs.

EXAMPLES Server Health

-   1) Listen for VSD_CFG_T. When get one, create local VSD structure-   2) Listen for VSE_CFG_T. When get one, wire to local VSD.-   3) Listen for SERVER_CFG_T. When get one, wire to local VSE.-   4) Start Server Health for server.-   5) Listen for FILESYS_CFG_T. When get one, wire to local SERVERNSE.-   6) Start Server Health read/write to file system.-   7) Listen for MED_SE_CFG_T. When get one, wire to local VSE.-   8) Start Server Health pings on IP specified in VSE referenced by    MED_SE_CFG_T.

Mediation Target

-   1) Listen for VSE_CFG_T. When get one with type of MED, create local    VSE structure.-   2) Listen for MED_SE_CFG_T. When get one, wire to local VSE.-   3) Setup mediation mapping based on information provided in    VSE/MED_SE pair.

Mediation Initiat r

-   1) Listen for LUN_MAP_CFG_T. When get one, request associated    SERVER_CFG_T from MIC.-   2) Create local SERVER structure.-   3) Add information from LUN_MAP_CFG_T to LUN map for that server.

NCM

-   1) Listen for SHARE_CFG_T with a type of NFS.-   2) Request associated FlLESYS_CFG_T from MIC.-   3) If existing filesystem_id, add to set. If new, create new replica    set.-   4) Bring new file system up to date. When finished, send    FILESYS_CFG_T with state of “ONLINE”.

The above features of the Pirus System allow storage devices attached toa Pirus System be discovered and managed by an IP-based NMS. This letsusers apply standards based; widely deployed systems that manage IP datanetworks manage storage devices—something currently not possible.

Accordingly, the Pirus System permits for the integration of storage(non-IP devices) devices (e.g., disks) into IP-based management systems(e.g., NMS), and thus provides unique features and functionality.

VII. NAS Mirroring and Content Distribution

The following section describes techniques and subsystems for providingmirrored storage content to external NAS clients in accordance with theinvention.

The Pirus SRC NAS subsystem described herein provides dynamicallydistributed, mirrored storage content to external NAS clients, asillustrated in FIG. 36. These features provide storage performancescalability and increased availability to users of the Pirus system. Thefollowing describes the design of the SRC NAS content distributionsubsystem as it pertains to NAS servers and NAS management processes.Load Balancing operations are described elsewhere in this document.

1. Content Distribution and Mirroring

1.1 Mirror Initialization via NAS

After volume and filesystem initialization—a complete copy of afilesystem can be established using the normal NAS facilities (createand write) and the maintenance procedures described hereinafter. Acurrent filesystem server set is in effect immediately after filesystemcreation using this method.

1.2 Mirror Initialization via NDMP

A complete filesystem copy can also be initialized via NDMP. Since NDMPis a TCP based protocol and TCP based load balancing is not initiallysupported, the 2nd and subsequent members of a NAS peer set must beexplicitly initialized. This can be done with additional NDMPoperations. It can also be accomplished by the filesystemsynchronization facilities described herein. Once initialization iscomplete a current filesystem server set is in effect.

1.3 Sparse Content Distribution

Partial filesystem content replication can also be supported. Sparsecopies of a filesystem will be dynamically maintained in response to IFFand MIC requests. The details of MIC and IXP interaction can be left toimplementers, but the concept of sparse filesystems and theirmaintenance is discussed herein.

2. NCM

The NCM (NAS Coherency Manager) is used to maintain file handlesynchronization, manage content distribution, and coordinate filesystem(re)construction. The NCM runs primarily on an SRC's 9th processor withagents executing on LIC IXPs and SRC 750's within the chassis.Inter-chassis NAS replication is beyond the scope of this document.

2.1 NCM Objectives

One of the primary goals of the NCM is to minimize the impact ofmirrored content service delivery upon individual NAS servers. NASservers within the Pirus chassis will operate as independent peers whilethe NCM manages synchronization issues “behind the scenes.”

The NCM will be aware of all members in a Configured Filesystem ServerSet. Individual NAS servers do not have this responsibility.

The NCM will resynchronize NAS servers that have fallen out of sync withthe Configured Filesystem Server Set, whether due to transient failure,hard failure, or new extension of an existing group.

The NCM will be responsible for executing content re-distributionrequests made by IFF load balancers when sparse filesystem copies aresupported. The NCM will provide Allocated Inode and Content Inode liststo IFF load balancers.

The NCM will be responsible for executing content re-distributionrequests made by the MIC when sparse filesystem copies are supported.Note that rules should exist for run-time contradictions between IXP andMIC balancing requests.

The NCM will declare NAS server “life” to interested parties in thechassis and accept “death notices” from server health related services.

2.2 NCM Architecture

2.3 NCM Processes and Locations

The NCM has components executing at several places in the Pirus chassis.

-   -   The primary NCM service executes on an SRC 9th processor.    -   An NCM agent runs on each SRC 750 CPU that is loaded for NAS.    -   An NCM agent runs on each IXP that is participating in a VSD.    -   A Backup NCM process will run on a 2nd SRC's 9th processor. If        the primary NCM becomes unavailable for any reason the secondary        NCM will assume its role.

2.4 NCM and IPC Services

The NCM will use the Pirus IPC subsystem to communicate with IFF and NASserver processors.

The NCM will receive any and all server health declarations, as well asany IFF initiated server death announcement. The NCM will announceserver life to all interested parties via IPC.

Multicast IPC messages should be used by NCM agents when communicatingwith the NCM service. This allows the secondary NCM to remainsynchronized and results in less disruptive failover transitions.

After chassis initialization the MIC configuration system will informthe NCM of all Configured Filesystem Server Sets via IPC. Any userconfigured changes to Filesystem Server Sets will be relayed to the NCMvia IPC.

NCM will make requests of NCM agents via IPC and accept their requestsas well.

2.5 NCM and Inode Management

All file handles (inodes) in a Current Filesystem Server Set should haveidentical interpretation.

The NCM will query each member of a Configured Filesystem Server Set forInodeList-Allocated and InodeList-Content after initialization and aftersynchronization. The NCM may periodically repeat this request forverification purposes.

Each NAS server is responsible for maintaining these 2 file handle usagemaps on a per filesystem basis. One map represents all allocated inodeson a server—IN-Alloc. The 2nd usage map represents all inodes withactual content present on the server-IN-Content. On servers where fulln-way mirroring is enabled the 2 maps will be identical. On serversusing content sensitive mirroring the 2nd “content” map will be a subsetof the first. Usage maps will have a global filesystem checkpoint valueassociated with them.

2.6 Inode Allocation Synchronization

All peer NAS servers must maintain identical file system and file handleallocations.

All inode creation and destruction operations must be multicast fromIXP/IFF source to an entire active filesystem server set. Thesemulticast packets must also contain a sequence number that uniquelyidentifies the transaction on a per IXP basis.

Inode creation and destruction will be serialized within individual NASservers.

2.7 Inode Inconsistency Identification

When an inode is allocated, deallocated or modified, the multicastingIXP must track the outstanding request, report inconsistency or timeoutas a NAS server failure to the NCM.

When all members of a current filesystem server set timeout on a singlerequest the IXP must consider that the failure is one of the followingevents:

-   -   IXP switch fabric multicast transmission error    -   Bogus client request    -   Simultaneous current filesystem server set fatality

The 3rd item is least likely and should only be assumed when the first 2bullets can be ruled out.

NAS servers must track the incoming multicast sequence number providedby the IXP in order to detect erroneous transactions as soon aspossible. If a NAS server detects a missing our out of order multicastsequence number it must negotiate its own death with NCM. If all membersof a current filesystem server set detect the same missing sequencenumber then the negotiation fails and the current filesystem server setshould remain active.

When an inconsistency is identified the offending NAS server will bereset and rebooted. The NCM is responsible for initiating this process.It may be possible to gather some “pre-mortem” information and possiblyeven undo final erroneous inode allocations prior to rebooting.

3. Filesystem Server Sets

3.1 Types

For a given filesystem, there are 3 filesystem server sets that pertainto it; configured, current and joining.

As described in the definition section, the configured filesystem serverset is what the user specified as being the cpus that he wants to servea copy of the particular filesystem. To make a filesystem ready forservice a current filesystem server set must be created. As serverspresent themselves and their copy of the filesystem to the NCM and aredetermined to be part of the configured server set, the NCM mustreconcile their checkpoint value for the filesystem with either thecurrent set's checkpoint value or the checkpoint value of joiningservers in the case where a current filesystem server set does not yetexist.

A current filesystem server set is a dynamic grouping of servers that isidentified by a filesystem id and a checkpoint checkpoint value. Thecurrent filesystem server set for a filesystem is created and maintainedby the NCM. The joining server set is simply the set of NAS servers thatare attempting to be part of the current server set.

3.2 States of the Current Server Set

A current filesystem server set can be active, inactive, or paused. Whenit is active, NFS requests associated with the filesystem id are beingforwarded from the IXPs to the members of the set. When the set isinactive the IXPs are dropping NFS requests to the server set. When theset is paused, the IXPs are queuing NFS requests destined for the set.

When a current filesystem server set becomes active and is servingclients and a new server wishes to join the set, we must at least pausethe set to prevent updates to the copies of the filesystem during thejoin operation. The benefit of a successful pause and continue versusdeactivate and activate is that NFS clients may not need to retransmitrequests that were sent while the new server was joining. There clearlyare limits to how many NFS client requests you can queue before you areforced to drop. Functionally both work. A first pass could leave out thepause and continue operations until later.

4. Description of Operations on a Current Filesystem Server Set

During the lifetime of a current filesystem set, for recovery purposesseveral items of information must be kept somewhere where an NCM canfind them after a fault

4.1 Create_Current_Filesystem_Server_Set(fsid, slots/cpus)

Given a set of cpus that are up, configured to serve the filesystem, andwishing to join, the NCM must decide which server has the latest copy ofthe filesystem, and then synchronize the other joining members with thatcopy.

4.2 Add_Member_To_Current Filesyst m_Server_S t(fsid, slot/cpu)

Given a cpu that wishes to join, the NCM must synchronize that cpu'scopy of the filesystem with the copy being used by the currentfilesystem server set.

Checkpoint_Current_Filesystem_Server_Set(fsid)

Since a filesystem's state is represented by its checkpoint value andmodified Inode-Lists, the time to recover from a filesystem with thesame checkpoint value is a function of the modifications represented bythe modified InodeList, it is desirable to checkpoint the filesystemregularly. The NCM will coordinate this. A new checkpoint value willthen be associated with the copies served by the current filesystemserver set and the modified InodeList on each member of the set will becleared. Get Status Of Filesystem Server Set(fsid, &status struct)

Return the current state of the filesystem server set. Structserver_set_status { Long configured_set; Long current_set; Longcurrent_set_checkpoint_value; Long joining_set; Int active_flag; };5. Description of Operations that Change the State of the Current ServerSet

5.1 Activate_Server_Set(fsid)

Allow NFS client requests for this fsid to reach the NFS servers on themembers of the current filesystem server set.

5.2 Pause Filesystem Server Set(fsid)

Queue NFS client requests for this fsid headed for the NFS servers onthe members of the current filesystem server set. Note any queue spaceis finite so pausing for too long can result in dropped messages. Thisoperation waits until all pending NFS modification ops to this fsid havecompleted.

5.3 Continue Filesystem Server Set(fsid)

Queued NFS client requests for this fsid are allow to proceed to the NFSservers on members of the current filesystem server set.

5.4 Deactivate_Server_Set(fsid)

Newly arriving NFS requests for this fsid are now dropped. Thisoperation waits until all pending NFS modification ops to this fsid havecompleted.

6. Recovery Operations on a Filesystem Copy

There are two cases of Filesystem Copy:

6.1 Construction: refers to the Initialization of a “filesystem copy”,which will typically entail copying every block from the Source to theTarget. Construction occurs when the Filesystem Synchronization Numberdoes not match between two filesystem copies.

6.2 Restoration: refers to the recovery of a “filesystem copy”.

Restoration occurs when the Filesystem Synchronization Number matchesbetween two filesystem copies.

Conceptually, the two cases are very similar to one another. There arethree phases of each Copy:

-   -   I. First-pass: copy-method everything that has changed since the        last Synchronization. For the Construction case, this really is        EVERY thing; for the Restoration case, this is only the inodes        in the IN-Mod list.    -   II. Copy-method the IN-Copy list changes, i.e. modifications        which occurred while the first phase was being done. Repeat        until the IN-Copy list is (mostly) empty; even if it is not        empty, it is possible to proceed to synchronization at the cost        of a longer synchronization time.    -   III. Synchronization by NCM: update of Synchronization Number,        clearing of the IN-Mod list. Note that by pausing ongoing        operations at each NAS (and IXP if a new NAS is being brought        into the peer group), it is possible to achieve synchronization        on-line (i.e. during active NFS modify operations).

The copy-method refers to the actual method of copying used in eitherthe Construction or Restoration cases. It is proposed here that thecopy-method will actually hide the differences between the two cases.

6.3 NAS-FS-Copy

An NAS-FS copy inherently utilizes the concept of “inodes” to performthe Copy. This is built-into both the IN-Mod and IN-Copy listsmaintained on each NAS.

6.3.1 Construction of Complete Copy

Use basic volume block-level mirroring to make “first pass” copy ofentire volume, from Source to Target NAS. This is an optimization totake advantage of sequential I/O performance; however, this will impactthe copy-method. The copy-method will be an ‘image’ copy, i.e. it is avolume block-by-block copy; conceptually, the result of the Constructionwill be a mirror-volume copy. (Actually, the selection of volumeblock-level copying can be determined by the amount of “used” filesystemspace; i.e. if the filesystem were mostly empty, it would be better touse an inode logical copy as in the Restoration case.)

For this to work correctly, since a physical-copy is being done, thecompletion of the Copy (i.e. utilizing the IN-Copy) must also be done atthe physical-copy level; stated another way, the “inode” copy-methodmust be done at the physical-copy level to complete the Copy.

6.4 Copy-method

The inode copy-method must exactly preserve the inode: this is not justthe inode itself, but also includes the block mappings. For example,copying the 128b of the inode will only capture the Direct, 2nd-level,and 3rd-level indirect blocks; it will not capture the data in theDirect, nor the levels of indirection embedded in both the 2nd/3rdindirect blocks. In effect, the indirect blocks of an inode (if theyexist) must be traversed and copied exactly; another way to state this,the list of all block numbers allocated to an inode must be copied.

6.5 Special Inodes

Special inodes will be instantiated in both IN-Mod and IN-Copy whichreflect changes to filesystem metadata: specifically block-allocationand inode-allocation bitmaps (or alternatively for each UFS'cylinder-group), and superblocks. This is because all physical changes(i.e. this is a physical-image copy) must be captured in thiscopy-method.

6.6 Locking

Generally, any missed or overlapping updates will be caught by repeatingIN-Copy changes; any racing allocations and/or de-allocations will bereflected in both the inode (being extended or truncated), and thecorresponding block-allocation bitmap(s). Note these special inodes arenot used for Sparse Filesystem Copies.

However, while the block map is being traversed (i.e. 2nd/3rd indirectblocks), changes during the traversal must be prevented to preventinconsistencies. Since the copy-method can be repeated, it would be bestto utilize the concept of a soft-lock which would allow an ongoingcopy-method to be aborted by the owning/Source-NAS if there was a racingextension/truncation of the file.

6.7 Restoration of Complete Copy

This step assumes that two NAS' differ only in the IN-Mod list; tocomplete re-Synchronization, it requires that all changed inodes bepropagated from the Source NAS to the Target NAS (since the lastsynchronization-point).

6.8 Copy-method

The Inode copy-method occurs at the logical level: specifically thecopying is performed by performing logical reads of the inode, and noinformation is needed of the actual block mappings (other than tomaintain sparse-inodes). Recall the Construction case required aphysical-block copy of the inode block-maps (i.e. block-map treetraversal), creating a physical-block mirror-copy of the inode

6.9 Special Inodes

No special inodes are needed; because per-filesystem metadata is notpropagated for a logical copy.

6.10 Locking

Similarily (to the construction case), a soft-lock around an inode isall that is needed.

6.11 Data structures

There are two primary Lists: the IN-Mod and the IN-Copy list. TheIN-Copy is logically nested within the IN-Copy.

6.11.1 Modified-Inodes-list (IN-Mod)

The IN-Mod is the list of all modified inodes since the last FilesystemCheckpoint:

-   -   Worst-case, if an empty filesystem was restored from backup, the        list would encompass every allocated inode.    -   Best-case, an unmodified filesystem will have an empty-list; or        a filesystem with a small working-set of inodes being modified        will have a (very) small list.

The IN-Mod is used as a recovery tool, which allows the owning NAS to beused as the ‘source’ for a NAS-FS-Copy. It allows the NCM to determinewhich inodes have been modified since the last Filesystem Checkpoint.

The IN-Mod is implemented non-volatile, primarily for the case ofchassis crashes (i.e. all NAS' crash), as one IN-Mod must exist torecover. Conceptually, the IN-Mod can be implemented as a Bitmap, or, asa List.

The IN-Mod tracks any modifications to any inode by a given NAS. Thiscould track any change to the inode ‘object’ (i.e. both inodeattributes, and, inode data), or, differentiate between the inodeattributes and the data contents.

The IN-Mod must be updated (for a given inode) before it is committed tonon-volatile storage (i.e. disk, or NVRAM); otherwise, there is a windowwhere the system could crash and the change not be reflected in theIN-Mod. In a BSD implementation, the call to add a modified inode to theIN-Mod could be done in VOP_UPDATE.

Finally, the Initialization case requires ‘special’ inodes to reflectnon-inode disk changes, specifically filesystem metadata; e.g.cylinder-groups, superblocks. Since Initialization is proposing to use ablock-level copy, all block-level changes need to be accounted for bythe IN-Mod.

6.11.2 Copy-Inodes-list (IN-Copy)

The IN-Copy tracks any modifications to an inode by a given NAS, once aCopy is in-progress: it allows a Source-NAS to determine which inodesstill need to be copied because it has changed during the Copy. In otherwords, it is an interim modified-list, which exists during a Copy. Oncethe Copying begins, all changes made to the IN-Mod are mirrored in theIN-Copy; this effectively captures all changes “since the Copy isin-progress”.

6.11.3 Copy Progress

The Source NAS needs to know which inodes to copy to the Target NAS.Conceptually, this is a snapshot ‘image’ of the IN-Mod before theIN-Copy is enabled, as this lists all the inodes which need to be copiedat the beginning of the Copy (and, where the IN-Copy captures allchanges rolling forward). In practice, the IN-Mod itself can be used, atthe minor cost of repeating some of the Copy when the IN-Copy isprocessed.

Note the IN-Copy need not be implemented in NVRAM, since any NAS crashes(either Source or Target) can be restarted from the beginning. If anIN-Copy is instantiated, the calls to update IN-Copy can be hidden inthe IN-Mod layer.

6.11.4 Copying Inodes

An on-disk inode is 128 bytes (i.e. this is effectively the inode'saftributes); the inode's data is variable length, and can vary between 0and 4 GB, in filesystem fragment-size increments. On-disk inodes tend tobe allocated in physically contiguous disk blocks, hence an optimizationis copy a large number of inodes all at once. CrosStor-Note: all inodesare stored in a reserved-inode (file) itself.

6.11.5 Construction Case

In this case, locking is necessary to prevent racing changes to theinode (and or data contents), as the physical image of the inode (anddata) needs to be preserved.

Specifically, the block mapping (direct and indirect blocks) need to bepreserved exactly in the inode; so both the block-mapping and everycorresponding block in the file have to be written to the same physicalblock together.

As an example, assume the race is where a given file is being firsttruncated, and then extended. Since each allocated-block needs to becopied exactly (i.e. same physical block number on the volume), care hasto be taken that the copy does not involve a block in transition.Otherwise, locking on block allocations would have to occur on thesource-NAS. Instead, locking on an inode would seem the betteralternative here. An optimization would be to allow a source-NAS to‘break’ a Copy-Lock, with the realization that an inode being Copiedshould defer to a waiting modification.

6.11.6 Restoration Case

In this case, no locking is implied during an inode-copy, since any“racing” modifications will be captured by the IN-Copy. A simpleoptimization might be to abort an in-progress Copy if such a ‘race’ isdetected; e.g., imagine a very large file Copy which is being modified.

Specifically, the inode is copied, but not the block-mapping; the filedata (represented by the block-mapping) is logically copied to thetarget NAS.

EXAMPLES Set 1

1. Walkthroughs of Operations on a Current Filesystem Server Set

Create_Current_Server_Set(fsid, slots/cpus)

-   -   Assumptions

Assume that no NAS server is serving the filesystem; the currentfilesystem server set is empty.

-   -   Steps        -   NAS A boots and tells the NCM it is up.        -   The NCM determines the new servers role in serving and that            the filesystem is not being served by any NAS servers.        -   The NCM asks server A for the checkpoint value for the            filesystem and also its modified InodeList.        -   The NCM insures that this is the most up to date copy of the            filesystem. (Reconciles static configuration info on            filesystem with which servers are actually running, looks in            NVRAM if needed . . . )        -   NCM activates the server set.        -   The filesystem is now being served.

Add_Member_to_Current_Filesystem_Server_Set(fsid)

-   -   Assumptions

Assume a complete copy of the filesystem is already being served.

-   The current filesystem server set contains NAS B.-   The current filesystem server set is active.-   NAS A is down.-   NAS A boots and tells the NCM it is up.    -   Steps        -   The NCM determines the new servers role in serving the            filesystem and determines the current server set for this            filesystem contains only NAS B.        -   The NCM asks server A for the checkpoint value for the            filesystem and also its modified InodeList.        -   NCM initiates recovery and asks NAS A to do it.        -   NAS A finishes recovery and tells the NCM.        -   The NCM pauses the current filesystem server set.        -   NCM asks NAS A to do recovery to catch anything that might            have changed since the last recovery request. This should            only include NFS requests received since the last recovery.        -   NAS A completes the recovery.        -   The NCM asks all members of the set to update their            filesystem checkpoint value. They all respond.        -   The NCM resumes the current filesystem server set.        -   A new filesystem checkpoint has been reached.

Checkpointing an Active Filesystem Server Set

-   -   Assumptions    -   Steps        -   NCM determines it is time to bring all the members of the            current server set to a checkpoint.        -   NCM asks the NCM agent on one member of the server set to            forward a multicast filesystem sync message to all members            of the current server set. This message contains a new            checkpoint value for the filesystem.        -   Upon receipt of this message the NAS server must finish            processing any NFS requests received prior to the sync            message that apply to the filesystem. New requests must be            deferred.        -   The NAS server then writes the new checkpoint value to            stable storage and clears any modified InodeLists for the            filesystem and updates the NFS modification sequence number.        -   The NAS servers then sends a message to the NCM indicating            that it has reach a new filesystem checkpoint.        -   The NCM waits for these messages from all NAS servers.        -   The NCM then sends multicast to the current server set            telling them to start processing NFS requests.        -   The NCM then updates it's state to indicate a new filesystem            checkpoint has been reached.

EXAMPLES Set 2

2. UML Static Structure Diagram

FIG. 37 is a representation of the NCM, IXP and NAS server classes. Foreach, the top box is the name, the second box contains attributes of aninstance of this class and the bottom box describes the methods eachclass must implement.

Attributes Description

Data local to an instance of the class that make it unique.

Methods Description

Those preceded with a + are public and usually invoked by receiving amessage. The method is preceded by the name of the sender of the messagesurrounded by << >>. Calling out the sender in the description shouldhelp you to correlate the messaging scenarios described in this documentto implemented methods in the classes. Those preceded by a − are privatemethods that may be invoked during processing of public methods. Theyhelp to organize and reuse functions performed by the class.

VIII. System Mediation Manager

The following discussion sets forth the functional specification anddesign for the Mediation Manager subsystem of the Pirus box.

Mediation refers to storage protocol mediation, i.e., mediating betweentwo transport protocols (e.g., FC and IP) that carry a storage protocol(SCSI). The system disclosed herein will use the mediationconfigurations shown in FIGS. 38A, B, C. Thus, for example, in FIG. 38A,the Pirus box terminates a mediation session. In FIGS. 38B and C, PirusBox1 originates a mediation session and Pirus box2 terminates it. InFIG. 38C, Pirus Box1 runs backup software to copy its disks to theother_Pirus box.

1. Components

In accordance with one embodiment of the invention, mediation is handledby a Mediation Manager and one or more Mediation Protocol Engines. Theirinteraction between each other and other parts of the Pirus box is shownin FIG. 39.

2. Storage Hierarchy

In accordance with known storage practice, at the lowest level ofstorage, there are physical disks, and each disk has one or more LUNs.In the system of the invention, as shown in the FIG. 40, the Volumemanager configures the disks in a known manner (such as mirroring, RAID,or the like) and presents them to the SCSI server as volumes (e.g., Vol1thru Vol 5). The SCSI server assigns each volume to a Virtual LUN (VL0thru VL2) in a Virtual Target (VT0 through VT1).

The following behaviors are observed:

-   -   1. Each Volume corresponds to only one Virtual LUN.    -   2. Each Virtual Target can have one or more Virtual LUNs.    -   3. Each Virtual Target is assigned an IP address.    -   4. A virtual target number is unique in a Pirus box.        3. Functional Specification

In one practice of the invention, the Mediation Manager will beresponsible for configuration, monitoring, and management of theMediation Protocol Engines; and only one instance of the MediationManager will run on each 755 on the SRC. Each Mediation Manager willcommunicate with the MIC and the Mediation Protocol Engines as shown inFIG. 4 above. The MIC provides the configurations and commands, and theMediation Protocol Engines will actually implement the various mediationprotocols, such as iSCSI, SEP, and the like. The Mediation Manager willnot be involved in the actual mediation, hence, it will not be in thedata path.

4. Functional Requirements

-   1. In one practice of the invention, the Mediation Manager always    listens to receive configuration and command information from the    MIC, and sends statistics back to the MIC.    -   2. The Mediation Manager accepts the following configuration        information from the MIC, and configures the Mediation Protocol

Engines appropriately:

-   -   a. Add a virtual target        -   i. Mediation Protocol            -   1. TCP/UDP port number            -   2. Max inactivity time        -   ii.Virtual target number        -   iii. IP address        -   iv. Number of LUNs        -   v. Max number of sessions    -   b. Modify a virtual target    -   c. Remove a virtual target

-   3. Once configured by the MIC, the Mediation Manager spawns only one    Mediation Protocol Engine for each configured mediation protocol. A    Mediation Protocol Engine will handle all the sessions for that    protocol to any/all the accessible disks on its Fiber Channel port.

-   4. The Mediation Manager accepts the following commands from the MIC    and sends a corresponding command to the appropriate Mediation    Protocol Engine:    -   a. Start/Stop a Mediation Protocol Engine    -   b. Abort a session    -   c. Get/Reset a stat for a mediation protocol and virtual target

-   5. The Mediation Manager will collect statistics from the Mediation    Protocol Engines and report them to the MIC. The stats are:    -   a. Number of currently established sessions per mediation        protocol per virtual target; this stat is unaffected by a stat        reset.    -   b. A list of all the sessions for a mediation protocol and        virtual target: virtual LUN, attached server, idle time; this        stat is unaffected by a stat reset.    -   c. Number of closed sessions due to “inactivity” per mediation        protocol per virtual target.    -   d. Number of denied sessions due to “max # of sessions reached”        per mediation protocol per virtual target.

-   6. The Mediation Manager will communicate the rules passed down by    the MIC to the appropriate Mediation Protocol Engine:    -   a. Host Access Control per mediation protocol (in one practice        of the invention, this will be executed on the LIC)        -   i. Deny sessions from a list of hosts/networks        -   ii. Accept sessions only from a list of hosts/networks    -   b. Storage Access Control per virtual target        -   i. Age out a virtual target, i.e., deny all new sessions to            a virtual target. This can be used to take a virtual target            offline once all current sessions die down.

-   7. The Mediation Manager (as ordered by the user through the MIC)    will send the following commands to the Mediation Protocol Engines:    -   a. Start (this may be equivalent to spawning a new engine)    -   b. Stop    -   c. Abort a session    -   d. Get/Reset stats for a mediation protocol and virtual target.

-   8. The Mediation Manager will register to receive ping (ICMP Echo    Request) packets destined for any of its virtual targets.

-   9. Once the Mediation Manager receives a ping (ICMP Echo Request)    packet for a virtual target, it will send a request to the “Storage    Health Service” for a status check on the specified virtual target.    Once the reply comes back from the Storage Health Service, the    Mediation Manager will send back an ICMP Echo Reply packet.

-   10. The Mediation Manager will register to send/receive messages    through IPC with the Storage Health Service.    5. Design

In the embodiment shown, only one Mediation Manager task runs on each755 on the SRC. It listens for configuration and command informationfrom the MIC to manage the Mediation Protocol Engines. It also reportsback statistics to the MIC. The Mediation Manager spawns the MediationProtocol Engines as tasks when necessary. In addition, it also handlesping (ICMP Echo Request) packets destined to any of its virtual targets.

6. Data Structures

In this embodiment, the data structures for keeping track of virtualtarget devices and their corresponding sessions are set up as shown inFIGS. 9-6. In the embodiment shown in FIG. 41, the number of supportedvirtual target devices on a Pirus box is 1024, with each having 256sessions; and the virtual target devices are different for terminationand origination.

At startup, the Mediation Manager sets up an array of MED_TYPE_CFG_T,one for each mediation protocol type: iSCSI, SEP, SCSI over UDP, and FCover IP. It will then allocate an array of pointers for each virtualtarget device, DEV_ENTRY_T. Once the MIC configures a new virtual targetdevice ( for termination or origination ) the Mediation Managerallocates and links in a MED_DEV_CFG_T structure. Finally, when a newsession is established, a MED_SESS_ENTRY_T structure is allocated.

This structure will provide a reasonable compromise between memoryconsumption and the speed at which the structure could be searched for adevice or session.

In this practice of the invention, a session id is a 32-bit entitydefined as follows to allow for direct indexing into the abovestructure.

Mediation type is 4 bits which allows for 16 mediation protocol types.

The next single bit indicates whether it is for termination ororigination.

The next 11 bits represent the device number, basically an index to thedevice array.

The 8 bits of session number is the index into the session array.

Finally, 8 bits of generation number is used to distinguish old sessionsfrom current sessions.

7. Flow Chart

In this practice of the invention, there will be one semaphore that theMediation Manager will wait upon. Two events will post the semaphore toawaken the Mediation Manager:

-   -   1. Arrival of a packet through IPCEP from the MIC    -   2. Arrival of a ping packet

As indicated in FIG. 42, the Mediation includes the following steps:

-   -   Initializing all data structures for mediation 4201;    -   Creating two queues: for ping packets and for IPCEP messages        4202;    -   Registering to receive IPCEP messages from the MIC 4203;    -   Registering to receive ping packets from the TCP/IP stack 4204;    -   Waiting to receive ping packets from the TCP/IP stack;    -   Waiting to receive a ping or IPCEP message;    -   Checking whether the received item is an IPCEP message, and if        so,    -   Retrieving the message form the queue and checking the message        type and calling the med_engine API (or similar process) and        then returning to the “wait to receive” step; or, if not,    -   Checking whether it is a ping packet, and if so, retrieving the        message from the queue, processing the ping packet, contacting        the storage health service, and returning to the “wait to        receive” step; or    -   if not a ping packet, returning to the “wait to receive” step.

IX. Mediation Caching

The following section describes techniques for utilizing data caching toimprove access times and decrease latency in a mediation systemaccording to the present invention. By installing a data cache on theClient Server (as illustrated in FIG. 43), the local clients can achievefaster access times for the data being served by the Data Server. Thecache will provide access to data that has already (recently) been readfrom the Data Server. In the case where a client attempts to access asegment of data that has been previously read, either by the same clientor any other attached client, the data can be delivered from the localcache. If the requested data is not in the local cache, the readoperation must be transmitted to the Data Server, and the server willaccess the storage system. Once the data is transferred back to theClient Server, the data will be stored in the local cache, and beavailable for other clients to access.

In a similar fashion, the write performance of the clients can beimproved by employing a Non-Volatile Ram (NVRAM) on the client server.Using the NVRAM, the system can reply to the local clients that thewrite operation is complete as soon as the data is committed to theNVRAM cache. This is possible since the data will be preserved in theNVRAM, and will eventually be written back to the Data Server forcommitment to the storage device by the system. The performance can befurther improved by altering the way in which the NVRAM data cache ismanipulated before the data is sent to the Data Server. The write datafrom the NVRAM can be accumulated such that a large semi-contiguouswrite access can be performed to the data server rather than smallpiecewise accesses. This improves both the data transmit characteristicsbetween the servers as well as improving the storage characteristics ofthe Data Server since a large transfer involves less processorintervention than small transfers.

This system improves latency on data writes when there is spaceavailable in the write cache because the client writer does not have towait for the write data to be transmitted to the Data Server and becommitted to the storage device before the acknowledgement is generated.The implied guarantees of commitment to the storage device is managed bythe system through the utilization of NVRAM and a system to deliver thedata to the Data Server after a system fault.

The system improves latency on data reads when the read data segment isavailable in the local read cache because the client does not have towait for the data transmission from the data server, or the storageaccess times before the data is delivered. In the case where the data isnot in the local cache the system performance is no worse that astandard system.

The system requires that the data in the write cache be available to theclient readers so that data integrity can be maintained. The order ofoperation for read access is

-   1) check the local write cache for data segment match-   2) (if not found in 1) check the local read cache for data segment    match-   3) (if not found in 2) issue the read command to the Data Server-   4) Once the data is transmitted from the Data Server save it in the    local data cache.    The order of operation for write access is    -   1) check the local read cache for a matching data segment and        invalidate the matching read segments    -   2) check the local write cache for matching write segments and        invalidate (or re-use)    -   3) generate a new write cache entry representing the write data        segments.

FIG. 43 shows the simple system with one Client Server per Data server.Note that the client server can have any number of clients, and a ClientServer can target any number of Data Servers.

The caching mechanism becomes more complex in a system such as the oneshown in FIG. 44. When a system contains more than one Client Server perData Server, the cache coherency mechanism must become more complex.This is because one client server can modify data that is in the localcache of the other client server, and the data will not match betweenthe Client Servers.

Cache coherency can be maintained in the more complex system bydetermining the state of the cache on the Data Server. Before any datacan be served from the Client Server local data cache, a message must besent to the data server to determine if the data in the local data cachemust be updated from the Data Server. One method of determining this isby employing time-stamps to determine if the data in the Client Serverlocal data cache is older than that on the Data Server. If the cache onthe Client Server needs to be updated before the data is served to theclient, a transmission of the data segment from the Data Server willoccur. In this case, the access from the client will look like astandard read operation as if the data were not in the local cache. Thelocal data cache will be updated by the transmission from the DataServer, and the time-stamps will be updated.

Similarly, in the data write case, the Data Server must be consulted tosee if the write data segments are locked by another client. If thesegments are being written by another Client Server during the time anew Client Server wants to write the same segments (or overlappingsegments), the new write must wait for the segments to be free (writeoperation complete from first Client Server). A light weight messagingsystem can be utilized to check and maintain the cache coherency bydetermining the access state of the data segments on the Data Server.

The order of operation for read access in the complex system is asfollows:

-   -   1) check the local write cache for data segment match    -   2) (if not found in 1) check the local read cache for data        segment match    -   3) (if the segment is found in local cache) send a request to        the Data Server to determine the validity of the local read        cache    -   4) If the local read cache is not valid, or the segment is not        found in the local cache, issue a read operation to the Data        Server.    -   5) Once the data is transmitted from the Data Server save it in        the local data cache.

Note that the case where the data cache is not valid can be optimized byreturning the read data in the event that the local cache data isinvalid. This saves an additional request round-trip.

The order of operation for write access in the complex system is

-   -   1) check the local read cache for a matching data segment and        invalidate the matching read segments    -   2) check the local write cache for matching write segments and        invalidate (or re-use)    -   3) Send a message to the Data Server to determine if the write        segment is available for writing (if the segment is not        available, wait for the segment to become available)    -   4) generate a new write cache entry representing the write data        segments.    -   5) Send a message to the Data Server to unlock the data        segments.

Note that in step 3, the message will generate a lock on the datasegment if the segment is available; this saves an additional requestround-trip.

X. Server Health Monitoring

The following discussion describes the Pirus Server Health Manager, asystem process that runs within the Pirus chassis and monitors the stateof storage services that are available to external clients. ServerHealth manages the state of Pirus Storage services, and uses that datato regulate the flow of data into the Pirus chassis. Server health willuse this information to facilitate load balancing, fast-forwarding ordiscarding of traffic coming into the system.

The Pirus Server Health Manager (SHM) is responsible for monitoring thestatus or health of a target device within the Pirus chassis. Pirustarget devices can include, for example, NAS and mediation/iSCSIservices that run on processors connected to storage devices.

In one practice of the invention, the SHM runs on the Pirus systemprocessor (referred to herein as Network Engine Card or NEC) where NASor iSCSI storage requests first enter the system. These requests areforwarded from this high-speed data path across a switched fabric totarget devices. SHM will communicate with software components in thesystem and provide updated status to the data-forwarding path.

1. Operation with Network Access Server (NAS)

In accordance with the invention, SHM communicates with components onthe NAS Storage Resource Card (SRC) to monitor the health of NFSservices. NFS requests are originated from the NEC and inserted into thedata stream along with customer traffic that enters from the high-speeddata path. Statistics are gathered to keep track of latency, timeoutsand any errors that may be returned from the server.

SHM also exchanges IPC messages with the NFS Coherency Manager (NCM) onthe SRC to pass state information between the two processors. Messagesequences exchanged between these two systems can originate from the NASor from the NEC.

2. Operation with iSCSI/Mediation Devices

SHM will also communicate with a Mediation Device Manager (MDM) thatruns on a SRC card and manages mediation devices like iSCSI. SHM willsend ICMP messages to each target and wait on responses. Statistics arealso gathered for mediation devices to keep track of latency, timeoutsand error codes. IPC messages will also be sent from the NEC to MDMwhenever an ICMP request times out.

Interaction with Data Forwarding Services: Data arrives into the Piruschassis from high-speed network interfaces like Ethernet. Low-leveldrivers and the Intelligent Filtering and Forwarding (IFF) component,described elsewhere in this document, receive this data. IFF works withcode in the IXP1200 Micro-engine to forward traffic across the backplaneto the NAS or iSCSI service.

3. Forwarding of NFS Traffic

Either a single server or multiple servers within the Pirus chassis canconsume NFS traffic. It is contemplated that NFS traffic forwarded to asingle server will always be sent to the same target CPU across thebackplane as long as that CPU and server are alive and healthy.

A group of NFS servers can provide the same ‘virtual’ service wheretraffic can be forwarded to multiple servers that reside on multipleCPUs. In this configuration, NFS write and create operations arereplicated to every member of the group, while read operations can beload balanced to a single member of the group. The forwarding decisionis based on the configured policy along with server health of each ofthe targets.

Load balancing decisions for read operations may be based on a virtualservice (defined by a single virtual IP address) and could be as simpleas round-robin, or, alternatively, use a configured weight to determinepacket forwarding. Health of an individual target could drop one ofthese servers out of the list of candidates for forwarding or affect theweighting factor.

Load balancing may also be based on NFS file handles. This requires thatserver health, IFF and micro-engine code manage state on NFS filehandles and use this state for load balancing within the virtualservice. File handle load balancing will work with target serverbalancing to provide optimum use of services within the Pirus chassis.

4. NFS Read Load Balancing Algorithms

The following read load balancing algorithms can be employed:

-   -   Round robin to each server within a virtual service    -   Configured weight of each server within a virtual service    -   Fastest response time determines weight of each server within a        virtual service    -   New file handle round robin to a server within a virtual        service, accesses to the same file handle are always directed to        the same server    -   New file handle configured weight to a server within a virtual        service, accesses to the same file handle are always directed to        the same server    -   Heavily accessed file list split across multiple servers

Each of the algorithms above will be affected by server health statusalong with previous traffic loads that have been forwarded. Servers maydrop out of the server set if there is congestion or failure on theprocessor or associated disk subsystem.

XI. Fast-Path: Description of Illustrated Embodiments

The following description refers to examples of Fast-Path implemented inthe Pirus Box and depicted in the attached FIGS. 45 and 46. As notedabove, however, the Fast-Path methods are not limited to the Pirus Box,and can be implemented in substantially any TCP/UDP processing system,with different combinations of hardware and software, the selection ofwhich is a matter of design choice. The salient aspect is that Fast-Pathcode is accelerated using distributed, synchronized, fast-path andslow-path processing, enabling TCP (and UDP) sessions to run faster andwith higher reliability. The described methods simultaneously maintainTCP state information in both the fast-path and the slow-path, withcontrol messages exchanged between fast-path and slow-path processingengines to maintain state synchronization and hand off control from oneprocessing engine to another. These control messages can be optimized torequire minimal processing in the slow-path engines while enablingefficient implementation in the fast path hardware. In particular, theillustrated embodiments provide acceleration in accordance with thefollowing principles:

-   -   1. Packet processing in a conventional TCP/IP stack is complex        and time consuming. However, most packets do not represent an        exceptional case and can be handled with much simpler and faster        processing. The illustrated embodiments (1) establish a        parallel, fast-path TCP/IP stack that handles the majority of        packets with minimal processing, (2) pass exceptions to the        conventional (slow-path) stack for further processing and (3)        maintain synchronization between fast and slow paths.    -   2. As a matter of design choice, the illustrated embodiments        employ IXP micro-engines to execute header verification, flow        classification, and TCP/IP check-summing. The micro-engines can        also be used for other types of TCP/IP processing. Processing is        further accelerated by this use of multiple, high-speed        processors for routine operations.    -   3. The described system also enables full control over the        Mediation applications described in other sections of this        document. Limits can be placed on the behavior of such        applications, further simplifying TCP/IP processing.        1. Fast-Path Architecture

Referring to FIG. 45, the illustrated Fast-Path implementations in thePirus Box include the following three units, the functions of which aredescribed below:

-   -   1. The Fast-Path module of the SRC card, which integrates the        Fast-Path TCP/IP stack. This module creates and destroys        Fast-Path sessions based on the TCP socket state, and executes        TCP/UDP/IP processing for Fast-Path packets.    -   2. Micro-engine code running on the IXPs. This element performs        IP header verification, flow classification (by doing a        four-tuple lookup in a flow forwarding table) and TCP/UDP check        summing.    -   3. IFF control code running on the IXP ARM. This module        creates/destroys forwarding entries in the flow forwarding table        based on the IPC messages from the SRC.        2. Fast-Path Functions

2.1 LRC Processing

Referring now to FIGS. 1 and 2, it will be seen that the illustratedembodiments of Fast-Path utilize both LRC and SRC processing. When VSEs(Virtual Storage Endpoints) are created, IP addresses are assigned toeach, and these IP addresses are added to the IFF forwarding isdatabases on all IXPs. For Mediation VSEs, forwarding table entries willbe labeled as Mediation in the corresponding destination IPC servicenumber. When the IXP Receive micro-engine receives a packet from itsEthernet interface, it executes a lookup in the IFF forwarding database.If a corresponding entry is found for that packet, and the associateddestination service is Mediation, the packet is passed to the IXPMediation micro-engine for Fast-Path processing. The IXP Mediationmicro-engine first verifies the IP header for correctness (length,protocol, IP checksum, no IP options and the like), verifies TCP/UDPchecksum, and then executes a flow lookup. If a corresponding entry isfound, flow ID is inserted into the packet (overwriting the MAC address)and the packet is forwarded to the Fast-Path service on the destinationSRC. If a corresponding entry is not found, the packet is forwarded tothe IFF service on the destination SRC.

2.2 SRC Processing

Referring again to FIGS. 45 and 46, when the Fast-Path service on theSRC receives packets from the IPC layer, the SRC extracts Session IDfrom the packet and uses it to look up socket and TCP control blocks. Itthen determines whether the packet can be processed by the Fast-Path:i.e., the packet is in sequence, no retransmission, no data queued inthe socket's Send buffer, no unusual flags, no options other thentimestamp, and timestamp is correct. If any condition is not met, thepacket is injected into the slow-path TCP input routine for fullprocessing. Otherwise, TCP counters are updated, ACK-ed data (if any) isreleased, an ACK packet is generated (if necessary), and the packet ishanded directly to the application.

2.3 Session Creation/Termination

In the illustrated embodiments, a Fast-Path session is establishedimmediately after establishment of a standard TCP session (Inside Acceptand Connect Calls); and destroyed just before the socket is closed(Inside Close Call). A socket's Send Call will be modified to attempt aFast-Path Send from the user task's context all the way to the IPC. IfFast-Path fails, the job will fail back to the regular (slow-path) codepath of the TCP Send Call, by sending a message to the TCP task.Conversely, the Fast-Path Receive routines, which can be executed froman interrupt or as a separate task, can forward received packets to theuser task's message queue, just as conventional TCP Receive processingdoes. As a result, from the perspective of the user application, packetsreceived by the Fast-Path system are indistinguishable from packetsreceived via the slow-path.

Referring again to FIGS. 45 and 46, at an initial time (i.e., prior toFast-Path session creation), there will be no entries in the flowforwarding table, and all packets will pass through the IFF/IP/TCP pathon the SRC as described in the other sections of this document. When aTCP (or UDP) connection is established, the TCP socket's code will callFast-Path code to create a Fast-Path session. When the Fast-Path sessionis created, all IXPs will be instructed to create a flow forwardingtable entry for the session. This ensures that if the route changes anda different IXP begins to receive connection data, appropriate routinginformation will be available to the “new” IXP. (In IP architectures itis possible to have an asymmetric path, in which outgoing packets aresent to an IXP different from the one receiving the incoming packets. Asa result, it would be insufficient to maintain a forwarding table onlyon the IXP that sends packets out.) Each time a Mediation forwardingtable entry is added to the associated IXP's forwarding table, it willbroadcast to all SRCs (or, in an alternative embodiment, uni-cast to theinvolved SRC) a request to re-post any existing Fast-Path sessions forthe corresponding address. This step ensures that when a new IXP isadded (or crashes and is then re-booted), the pre-existing Fast-Pathstate is restored. Subsequently, when the TCP (or UDP) connection isterminated, the TCP sockets code will call Fast-Path code to delete thepreviously-created Fast-Path session. All IXPs then will be instructedto destroy the corresponding flow forwarding table entry.

In the case that an SRC processor crashed or was removed from service,the MIC module will detect the crash or removal, and issue a command toremove the associated Mediation IP address. Similarly, if the SRCprocessor is restarted, it will issue a command to once again add thecorresponding Mediation IP address. When the IFF module on the IXPremoves the forwarding entry for the corresponding Mediation IP address,it will also remove all corresponding Fast-Path session forwardingentries.

2.4 Session Control Blocks

The described Fast-Path system maintains a table of Fast-Path SessionControl blocks, each containing at least the following information:

-   -   1. Socket SID and SUID, for Fast TCP and Socket Control blocks        in Receive operations.    -   2. TCP/IP/Ethernet or UDP/IP/Ethernet header templates for Send        operations.    -   3. Cached IP next-hop information, including outgoing source and        destination MAC addresses, and the associated IXP's slot,        processor and port numbers.

An index of the Session Control block serves as a Session ID, enablingrapid session lookups. When a Fast-Path Session is created, the SessionID is stored in the socket structure to enable quick session lookupduring Sends.

2.5 IXP Services

Referring again to FIGS. 45 and 46, when a new Fast-Path session isestablished, the IXPs in the Pirus Box are set to forward TCP or UDPflow to a well-known Fast-Path service on the destination SRC processor.The associated IXP will insert an associated Fast-Path flow ID into thefirst word of the packet's Ethernet header (thereby overriding thedestination MAC address) to permit easy flow identification by theFast-Path processing elements. The IXP will execute a lookup of afour-tuple value (consisting of ip_src, ip_dst, port_srt, port_dst) inthe forwarding table to determine destination. (card, processor, flowID). In addition, the IXP will execute the following steps for packetsthat match the four-tuple lookup:

-   -   1. Check IP header for correctness. Drop packet if this fails.    -   2. Execute IP checksum. Drop packet if this fails.    -   3. Confirm that there is/are no fragmentation or IP option. (As        a matter of design choice, certain TCP options are permitted,        for timestamp and RDMA.) If this fails, forward the packet to        the SRC “slow path” (IFF on SRC).    -   4. Execute TCP or UDP checksum. If this fails, send packet to a        special error service on the SRC.

The IXP can also execute further TCP processing, including, but notlimited to, the following steps:

-   -   1. Confirm that header length is correct.    -   2. Confirm that TCP flags are ACK and nothing else.    -   3. Confirm that the only option is TCP timestamp.    -   4. Remember last window value and confirm that it has not        changed.

The IXP can also have two special well-known services: TCP_ADD_CHECKSUMand UDP_ADD_CHECKSUM. Packets sent to these services will have TCP andIP, or UDP and IP checksums added to them. Thus, the illustratedFast-Path embodiment can utilize a number of well-known services,including 2 on the IXP

IPC_SVC_IXP_TCP_CSUM adds TCP checksum to outbound packetsIPC_SVC_IXP_UDP_CSUM adds UDP checksum to outbound packets and 3 on theSRC: IPC_SVC_SRC_FP Fast-Path input IPC_SVC_SRC_SP “slow path” inputIPC_SVC_SRC_FP_ERR error service that increments error counters.3. Further Fast Path Aspects

Referring again to FIGS. 45 and 46, all Fast-Path IPC services (i.e.,each service corresponding to a TCP or UDP connection) will have thesame IPC callback routine. Flow ID can be readily extracted from theassociated Ethernet header information, and can be easily translatedinto socket descriptor/socket queue ID by executing a lookup in aFast-Path session table. Subsequently, both TCB and socket structurepointers can also be quickly obtained by a lookup.

Fast-Path processing will be somewhat different for TCP and UDP. In thecase of UDP, Fast-Path processing of each packet can be simplifiedsubstantially to the updating of certain statistics. In the case of TCP,however, a given packet may or may not be eligible for Fast-Pathprocessing, depending on the congestion/flow-control state of theconnection. Thus, a Fast-Path session table entry will have a functionpointer for either TCP or UDP Fast-Path protocol handler routines,depending on the socket type. In addition, the TCP handler willdetermine whether a packet is Fast-Path eligible by examining theassociated Fast-Path connection entry, TCP header, TCP control block,and socket structure. If a packet is Fast-Path eligible, the TCP handlerwill maintain the TCP connection, and transmit control information tothe Mediation task's message queue. If the TCP stack's Send processneeds to be restarted, the TCP handler will send a message to the TCPstack's task to restart the buffered Send. Conversely, if a packet isnot eligible for Fast-Path, the TCP handler will send it to theslow-path IP task.

In the illustrated embodiments, the Socket Send Call checks to determinewhether the socket is Fast-Path enabled, and if it is, calls theFast-Path Send routine. The Fast-Path Send routine will obtain socketand TCB pointers and will attempt to execute a TCP/IP shortcut and sendthe packet directly to the IPC. In order to leave a copy of the data inthe socket, in case TCP needs to retransmit, the Fast-Path module willduplicate BJ and IBD, increment the REF count on the buffer, and add IBDto the socket buffer. The illustrated embodiments of Fast-Path do notcalculate TCP and IP check-sums, but maintain two well-known servicenumbers, TCP_CHECKSUM_ADD, UDP_CHECKSUM_ADD; and the IXP will addchecksums on the packets received on these services. The destination IXPwill be determined by referencing the source IXP of the last receivedpacket. If the Fast-Path system is unable to transmit the packetdirectly to the IPC it will return an error code to the Socket SendRoutine, which will then simply continue its normal code path and sendthe packet to the slow-path TCP task's message queue for furtherprocessing.

To provide additional streamlining and acceleration of TCP/UDP packetprocessing, a number of optional simplifications can be made. Forexample, the described Fast-Path does not itself handle TCP connectionestablishment and teardown. These tasks are handled by the conventionalTCP stack on the SRC. Similarly, the described Fast-Path does not itselfhandle IP options and IP fragmentation; these conditions are handled bythe conventional TCP stacks on both the LRC and the SRC. In theillustrated embodiments, Fast-Path handles the TCP timestamp option,while the conventional TCP stack on the SRC handles all other options.Similarly, the described Fast-Path system does not handle TCPretransmission and reassembly; these aspects are handled by theconventional TCP stack on the SRC. Certain security protocols, such asIPSec, change the IP protocol field and insert their own headers betweenthe IP and TCP headers. The illustrated Fast-Path embodiments can bemodified to handle this circumstance.

Fast-Path can be enabled by each socket's application on a per-socketbasis. The system can be set to be disabled by default, and can beenabled by doing a socket ioctl after a socket is created, but before aconnection is established. Apart from this, the described Fast-Path istransparent for the socket application, from the viewpoint of the socketinterface.

The performance gains provided by Fast-Path are in part a function ofthe number of TCP retransmissions in the network. In networks having alarge number of packet drops, most of the packets will go through theconventional TCP stack instead of the Fast-Path system. However, in a“good” LAN with limited packet drops, more than 90% of packets will gothrough Fast-Path, thus providing significant performance improvements.

For example, the invention can be implemented in the Pirusinterconnection system described below and in U.S. provisional patentapplication No. 60/245,295 (referred to as the “Pirus Box”). The PirusBox routes, switches and bridges multiple protocols across FibreChannel, Gigabit Ethernet and SCSI protocols and platforms, therebyenabling interoperability of servers, NAS (network attached storage)devices, IP and Fibre Channel switches on SANs (storage area networks),WANs (wide area networks) or LANs (wide area networks). Within the PirusBox, multiple front-end controllers (IXPs) connect to a high-speedswitching fabric and point-to-point serial interconnect. Back-endcontrollers connect to switched Ethernet or other networks, managing theflow of data from physical storage devices.

In one implementation of the invention within the Pirus Box, theFast-Path includes Fast-Path code running on 750-series microprocessors,with hardware acceleration in IXP micro-engines. Alternatively, in aconfiguration having a close coupling between the IXP modules and theprocessors terminating TCP sessions, the Fast-Path code is executedtogether with the hardware acceleration in the IXP micro-engines. Ineach case, the described Fast-Path code can be highly optimized andplaced in gates or micro-engines. Such code will execute much fasterthan a conventional TCP/IP stack, even when running on the sameprocessor as a conventional stack.

The Fast-Path methods described herein are not limited to the Pirus Box,but can be implemented in substantially any TCP/UDP processing system.

GLOSSARY OF TERMS

-   Backplane—the Pirus box chassis is referred to herein a backplane;    however, it will be recognized that the chassis could alternatively    be a midplane design.-   CLI—Command Line Interface-   FC—Fibre Channel-   FSC—Fibre Channel Switching Card-   IFF—Layer 2, 3, 4 and 5 Intelligent Filtering and Forwarding switch-   JBOD—Just a Bunch of Disks-   LIC—LAN Interface Card-   MAC—Media Access Control—usually refers to an Ethernet interface    chip-   MIC—Management Interface Card-   MTU—Maximum Transfer Unit—largest payload that can be sent on a    medium.-   NEC—Network Engine Card-   NP—Network Processor-   SCSI—Small Computer Systems Interface-   SRC—Resource Module Card-   uP—Microprocessor-   ARP—Address Resolution Protocol-   CLI—Command Line Interface-   CONSOLE—System Console-   CPCM—Card/Processor Configuration Manager-   CSA—Configuration and Statistics Agent-   CSM—Configuration and Statistics Manager-   DC—Disk Cache-   Eth Drver—Ethernet Driver-   FC Nx—Fibre Channel Nx Port-   FFS—Flash File System-   FS—File System-   HTTP—Hyper Text Transfer Protocol-   HTTPS—Hyper Test Transfer Protocol Secured-   IP—Internet Protocol-   IPC—Inter Process Communication-   L2—Layer 2-   LHC—Local Hardware Control-   LOGI—Logging Interface-   MLAN—Management LAN-   MNT—Mount-   NFS—Network File Server-   RCB—Rapid Control Backplane-   RPC—Remote Procedure Call-   RSS—Remote Shell Service-   S2—System Services-   SAM—System Abstraction Model-   SB—Service Broker-   SCSI—Small Computer System Interface-   SFI—Switch Fabric Interface-   SGLUE—SNMP Glue-   SNMP—Simple Network Management Protocol-   SSC—Server State Client-   SSH—Secured Shell-   SSM—Server State Manager-   TCP—Transmission Control Protocol-   UDP—User Datagram Protocol-   VM—Volume Manager-   WEBH—WEB Handlers    Configured Filesystem Server Set: The set of NAS servers that have    been configured by the user to serve copies of the filesystem. Also    referred to as a NAS peer group.    Current Filesystem Server Set: The subset of the configured    filesystem server set that is made up of members that have    synchronized copies of the filesystem.    Joining Filesystem Server Set: Members not part of the Current    Filesystem Set that are in the process of joining that set.    Complete copy of a Filesystem: A copy of a filesystem containing    file data for all file inodes of a filesystem.    Construction of a Filesystem Copy: Building a sparse or complete    copy of a filesystem by copying every element of the source    filesystem.    Filesystem Checkpoint: NCM has insured that all members of the    current filesystem server set have the same copy of the filesystem.    A new filesystem checkpoint value was written to all copies and    placed on stable storage. The filesystem modification sequence    number on all members of the current filesystem server set is the    same. The IN-MOD has been cleared on all members of the current    filesystem server set.    Filesystem Checkpoint Value: Filesystems and NVRAM are marked with a    filesystem checkpoint value to indicate when running copies of the    filesystem were last checkpointed. This is used to identify stale    (non-identical, non-synchronized) filesystems.    Filesystem Modification Sequence Number: The number of NFS    modification requests performed by a NAS server since the last    filesystem checkpoint. Each NAS server is responsible for    maintaining its own stable storage copy that is accessible to the    NCM after a failure. The filesystem checkpoint value combined with    this number indicate which NAS server has the most recent copy of    the filesystem.    Inode List Allocated (IN-Alloc): The list of inodes in a filesystem    that have been allocated.    Inode List Content (IN-Con): The list of inodes in a filesystem that    have content present on a server; this must be a subset of IN-Alloc.    This will include every non-file (i.e. directory) inode. If this is    a Complete Copy of a Filesystem, then IN-Con is identical to    IN-Alloc.    Inode List Copy (IN-Copy): Which inodes of a filesystem have been    modified since we began copying the filesystem (during    Construction/Restoration); in the disclosed embodiments, this must    be a subset of IN-Con.    Inode List Modified (IN-Mod): Which inodes have been modified since    the last filesystem checkpoint. 2 filesystems with the same    filesystem checkpoint value should only differ by the changes    represented by their modified InodeList. A Filesystem Checkpoint    between two filesystems means that each is a logical image of one    another, and the IN-Mod can be cleared.    NCM—NAS Coherency Manager: The Pirus chassis process that is    responsible for synchronizing peer NAS servers.    Peer NAS Server: Any CPU that is a member of a virtual storage    target group (VST).    Recovery of a Filesystem Copy: Bringing an out of date filesystem    copy in sync with a later copy. This can be accomplished by    construction or restoration.    Restoration of a Filesystem Copy: Bringing a previously served    filesystem from its current state to the state of an up to date copy    by a means other than an element by element copy of the original.    Sparse copy of a filesystem: A copy of a filesystem containing file    data for less than all file inodes of a filesystem.    VST—Virtual Storage Target: As used herein, this term refers to a    group of NAS server CPUs within a Pirus chassis that creates the    illusion of a single NAS server to an external client.    ARM, StrongARM processors: general-purpose processors with embedded    networking protocols and/or applications compliant with those of ARM    Holdings, PLC (formerly Advanced RISC Machines) of Cambridge, U.K.    BSD: sometimes referred to as Berkeley UNIX, an open source    operating system developed in the 1970s at U.C. Berkeley. BSD is    found in nearly every variant of UNIX, and is widely used for    Internet services and firewalls, timesharing, and multiprocessing    systems.    IFF—Intelligent Forwarding and Filtering (described elsewhere in    this document in the context of the Pirus Box architecture).    IOCTL: A system-dependent device control system call, the ioctl    function typically performs a variety of device-specific control    functions on device special files.    IPC: Inter-Process Communications. On the Internet, IPC is    implemented using TCP transport-layer protocol.    IPSec: IP security protocol, a standard used for interoperable    network encryption.    IXP: Internet Exchange Processors, such as Intel's IXP 1200 Network    Processors, can be used at various points in a network or switching    system to provide routing and other switching functions. Intel's IXP    1200, for example, is an integrated network processor based on the    StrongARM architecture and six packet-processing micro-engines. It    supports software and hardware compliant with the Intel Internet    Exchange Architecture (IXA). See Pirus Box architecture described    elsewhere in this document.    LRC: LAN Resource Card. In the Pirus Box described herein, the LRC    interfaces to external LANs, servers or WANS, performs load    balancing and content-aware switching, implements storage mediation    protocols and provides TCP hardware acceleration in accordance with    the present invention.    MAC address: Media Access Control address; a hardware address that    uniquely identifies each node of a network.    Micro-engine: Micro-coded processor in the IXP. In one    implementation of the Pirus Box, there are six in each IXP.    NFS: Network File Server    Protocol Mediation: applications and/or devices that translate    between and among different protocols, such as TCP/IP, X.25, SNMP    and the like. Particular Mediation techniques and systems are    described elsewhere in this document in connection with the Pirus    Box.    RDMA: Remote Direct Memory Access. The transfer of application data    from a remote buffer into a contiguous local buffer. Typically    refers to memory-to-memory copying between processors over TCP    protocols such as HTTP and NFS across an Ethernet.    SCSI: Small Computer System Interface, widely-used ANSI    standards-based family of protocols for communicating with I/O    devices, particularly storage devices.    iSCSI: Internet SCSI, a proposed transport protocol for SCSI that    operates on top of TCP, and transmits native SCSI over a layer of    the IP stack. The Pirus Box described herein provides protocol    mediation services to iSCSI devices and networks (“iSCSI Mediation    Services”), using TCP/IP to provide LAN-attached servers with access    to block-oriented storage.    Silly Window Avoidance Algorithm (Send-Side): A technique in which    the sender delays sending segments until it can accumulate a    reasonable amount of data in its output buffer. In some cases, a    “reasonable amount” is defined to be a maximum-sized segment (MST).    SRC: Storage Resource Card. In the Pirus Box architecture described    herein, the SRC interfaces to external storage devices, provides NFS    and CIFS services, implements IP to Fibre Channel (FC) storage    mediation, provides volume management services (including dynamic    storage partitioning and JBOD (Just a Bunch of Disks) aggregation to    create large storage pools), supports RAID functionality and    provides integrated Fibre Channel SAN switching.    TCP: Transmission Control Protocol, a protocol central to TCP/IP    networks. TCP guarantees delivery of data and that packets will be    delivered in the same order in which they were sent.    TCP/IP: Transmission Control Protocol/Internet Protocol, the suite    of communications protocols used to connect hosts on the Internet.    UDP: User Datagram Protocol (UDP) supports a datagram mode of    packet-switched communications in an interconnected set of computer    networks, and enables applications to message other programs with a    minimum of protocol mechanism. UDP is considerably simpler than TCP    and is useful in situations where the reliability mechanisms of TCP    are not necessary. The UDP header has only four fields: source port,    destination port, length, and UDP checksum.    VxWorks: a real-time operating system, part of the Tornado II    embedded development platform commercially available from WindRiver    Systems, Inc. of Alameda, Calif., which is designed to enable    developers to create complex real-time applications for embedded    microprocessors.

TABLE OF CONTENTS Incorporation by Reference/Priority Claim Field of theInvention Background of the Inventor Summary of the Invention BriefDescription of the Drawings Detailed Description of the Invention I.Overview II. Hardware/Software Architecture 1. Software ArchitectureOverview 1.1. System Services 1.1.1. SanStreaM (SSM) System Services(S2) 1.1.2. SSM Application Service (AS) 2. Management Interface Card2.1. Management Software 2.2. Management Software Overview 2.2.1. UserInterfaces (Uis) 2.2.2. Rapid Control Backplane (RBI) 2.2.3. SystemAbstraction Model (SAM) 2.2.4. Configuration & Statistics Manager (CSM)2.2.5. Logging/Billing (APIs) 2.2.6. Configuration & Statistics Agent(CSA) 2.3. Dynamic Configuration 2.4. Management Applications 2.4.1.Volume Manager 2.4.2. Load Balancer 2.4.3. Server-less Backup (NDMP)2.4.4. IP-ized Storage Management 2.4.5. Mediation Manager 2.4.6. VLANManager 2.4.7. File System Manager 2.5. Virtual Storage Domain (VSD)2.5.1. Services 2.5.2. Policies 2.6. Boot Sequence and Configuration 3.LIC Software 3.1. VLANs 3.1.1. Intelligent Filtering and Forwarding(IFF) 3.2. Load Balance Data Flow 3.3. LIC - NAS Software 3.3.1. VirtualStorage Domains (VSD) 3.3.2. Network Address Translation (NAT) 3.3.3.Local Load Balance (LLB) 3.3.3.1. Load Balancing Order of Operations3.3.3.2. File System Server Load Balance (FSLB) 3.3.3.3. NFS Server LoadBalancing (NLB) 3.3.3.4. TCP and UDP - Methods of Balancing 3.3.3.5.Write Replication 3.3.4. Load Balancer Failure Indication 3.3.4.1. CIFSServer Load Balancing 3.3.4.2. Content Load Balance 3.4. LIC - SCSI/IPSoftware 3.5. Network Processor Functionality 3.5.1. Flow Control3.5.1.1. Flow Definition 3.5.1.2. Flow Control Model 3.5.2. Flow Thru V.Buffering 3.5.2.1. Flow Thru 3.5.2.2. Buffering 4. SRC NAS (SoftwareFeatures) 4.1. SRC NAS Storage Features 4.1.1. Volume Manager 4.1.2.Disk Cache 4.1.3. SCSI 4.1.4. Fibre Channel 4.1.5. Switch FabricInterface 4.2. NAS Pirus System Features 4.2.1. Configuration/Statistics4.2.2. NSF Load Balancing 4.2.3. NFS Mirroring Service 5. SRC Mediation5.1. Supported Mediation Protocols 5.1.1. SCSI/UDP 5.2. StorageComponents 5.2.1. SCSI/IP Layer 5.2.2. SCSI Mediator 5.2.3. VolumeManager 5.2.4. SCSI Originator 5.2.5. SCSI Target 5.2.6. Fibre Channel5.3. Mediation Example III. NFS Load Balancing 1. Operation 1.1. ReadRequests 1.2. Determining the Number of Servers for a File 1.3. ServerLists 1.3.1. Single Server List 1.3.2. Multiple Server Lists 1.4.Synchronizing Lists Across Multiple IXP's IV. Intelligent Forwarding andFiltering 1. Definitions 2. Virtual Domains 2.1. Network AddressTranslation 3. VLAN Definition 3.1. Default VLAN 3.2. ServerAdministration VLAN 3.3. Server Access VLAN 3.4. Port Types 3.4.1.Router Port 3.4.2. Server Port 3.4.3. Combo Port 3.4.4. ServerAdministration Port 3.4.5. Server Access Port 3.4.6. Example of VLAN 4.Filtering Function 5. Forwarding Function 5.1. Flow Entry Description5.1.1. Source IP Address 5.1.2. Destination IP Address 5.1.3.Destination TCP/UDP port 5.1.4. Source physical port 5.1.5. Sourcenext-hop MAC address 5.1.6. Destination physical port 5.1.7. Destinationnext-hop MAC address 5.1.8. NAT IP address 5.1.9. NAT TCP/UDP port5.1.10. Flags 5.1.11. Received pack ts 5.1.12. Transmitted pack ts5.1.13. R ceived bytes 5.1.14. Transmitted bytes 5.1.15. N xt point r(receive path) 5.1.16. Next pointer (transmit path) 5.2. AddingForwarding Entries 5.2.1. Client IP Addresses 5.2.2. Virtual Domain IPAddresses 5.2.3. Server IP Addresses 5.3. Distribute the ForwardingTable 5.4. Ingress Function 6. Egress Function V. IP-Based StorageManagement - Device Discovery & Monitoring Examples: Server HealthMediation Target Mediation Initiator NCM VI. DATA STRUCTURE LAYOUT 1.VSD_CFG_T 2. VSE_CFG_T 3. SERVER_CFG_T 4. MED_TARG_CFG_T 5.LUN_MAP_CFG_T 6. FILESYS_CFT_T VII. NAS Mirroring and ContentDistribution 1. Content Distribution and Mirroring 2. MirrorInitialization via NAS Mirror Initialization via NDMP Sparse ContentDistribution NCM NCM Objectives NCM Architecture NCM Processes andLocations NCM and IPC Services NCM and Inode Management Inode AllocationSynchronization Inode Inconsistency Identification 3. Filesystem ServerSets 3.1. Types 3.2. State of the Current Server Set 4. Description ofOperations Create_Current_Filesystem_Server_Set (fsid, slots/cpus)Add_Member_To_Current Filesystem_Server_Set (fsid, slot/cpu) 5.Description of Operations that Change the State of the Current ServerSet Activate_Server_Set (fsid) Pause Filesystem Server Set (fsid)Continue Filesystem Server Set (fsid) Deactivate_Server_Set (fsid) 6.Recovery Operations on a Filesystem Copy Construction RestorationNAS-FS-Copy Construction of Complete Copy Copy Method Special InodesLocking Restoration of Complete Copy Data structuresModified-Inodes-list (IN-Mod) Copy-Inodes-list (IN-Copy) Copy progressCopying Inodes Construction case Restoration Case Examples - Set 1Examples - Set 2 VIII. System Mediation Manager Components StorageHierarchy Functional Specification Functional Requirement Design DataStructures Flow Chart X. Mediation Caching XI. Server Health MonitoringOperation with Network Access Server (NAS) Operation with iSCS/MediationDevices Forwarding of NFS Traffic NFS R ad Load Balancing AlgorithmsXII. Fast-Path: Description of Illustrated Embodiments Fast-PathArchitecture Fast-Path Functions LRC Processing SRC processing Sessioncreation/termination Session Control Blocks IXP Services Further FastPath Aspects ABSTRACT

1. In a digital network including at least first and second ClientServers, each of the first and second Client Servers being operable tocommunicate with (1) respective local clients and (2) a remote DataServer to request access to data files on storage devices connected tothe remote Data Server, the digital network being operable to providemediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating read access to data by clients, the method comprising:providing, for each of the first and second Client Servers, a respectivelocal read cache operable to communicate with the respective ClientServer, operable to store a copy of recently read data; providing, foreach of the first and second Client Servers, a respective local writecache operable to communicate with the respective Client Server,operable to store a copy of data to be written; receiving a read accessrequest from one of the local clients in communication with the first orsecond Client Server; in response to receipt of the read access request,checking the respective local write cache for a data segment match; ifno data segment match is found in the respective local write cache,checking the respective local read cache for a data segment match; ifthe segment is found in the respective local read cache, transmitting tothe remote Data Server a request to determine the validity of the datain the respective local read cache, thereby to determine whether thedata in the respective local read cache must be updated from the remoteData Server, if the data in the respective local read cache is notvalid, or if no data segment match is found in the respective local readcache, transmitting the read access request to the remote Data Serverfor serving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the respective local read cache.
 2. The method of claim 1,further comprising: assigning a time-stamp to a data segment stored inthe respective local read cache; assigning a time-stamp to a datasegment stored in the remote Data Server; upon receipt of a request todetermine the validity of the data in the respective local read cache,comparing the time-stamp of the data segment stored in the respectivelocal read cache with the time-stamp of a comparable data segment storedin the remote Data Server to determine whether the data segment in therespective local read cache is older than the comparable data segment onthe remote data server; and if the data segment in the respective localread cache is older than the comparable data segment on the remote dataserver, designating as invalid the data segment in the respective localread cache.
 3. The method of claim 2, further comprising: if the datasegment in the respective local read cache is designated invalid, thentransmitting the read access request to the remote Data Server forserving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the respective local read cache and updating the respectivetime-stamps.
 4. In a digital network including at least first and secondClient Servers, each of the first and second Client Servers beingoperable to communicate with (1) respective local clients and (2) aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the digital network being operableto provide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating response to write access requests by clients, the methodcomprising: providing, for each of the first and second Client Servers,a respective local read cache operable to communicate with therespective Client Server, operable to store a copy of recently readdata; providing, for each of the first and second Client Servers, arespective local write cache operable to communicate with the respectiveClient Server, operable to store a copy of data to be written; receivinga write request from one of the local clients in communication with thefirst or second Client Server; in response to receipt of the writerequest, checking the respective local read cache for a data segmentmatch, and if a matching data segment is detected, invalidating thematching data segment; checking the respective local write cache for adata segment match, and if a matching write segment is detected,invalidating or reusing the matching write segment; transmitting to theremote Data Server a request to determine whether the write segment isavailable for writing, and if the segment is unavailable, waiting forthe write segment to become available; generating a new write cacheentry representing the write data segments to be written; andtransmitting to the remote Data Server a request to unlock the datasegments to be written.
 5. The method of claim 4 further comprising:determining whether data segments are available for writing by the firstClient Server by checking whether the data segments are being written orotherwise are locked by the second Client Server during the time thefirst Client Server requests write access to the same or overlappingdata segments.
 6. The method of claim 5 wherein the determining furthercomprises generating a lock request on the requested data segment if thesegment is available.
 7. In a switching system adapted to interconnectlocal clients in communication with a Client Server, the Client Serverbeing operable to communicate with a remote Data Server to requestaccess to data files on storage devices connected to the remote DataServer, the switching system being operable to provide mediation betweenstorage and networking protocols used for communication between clients,servers and storage devices, a method of accelerating read access todata by clients, the method comprising: providing a local read cacheoperable to communicate with the Client Server, operable to store a copyof recently read data; providing a local write cache operable tocommunicate with the Client Server, operable to store a copy of data tobe written; storing within the local data cache a copy of data recentlyread from the remote Data Server; receiving a read access request from afirst one of the local clients in communication with the Client Server;in response to receipt of the read access request, checking the localwrite cache for a data segment match; if no data segment match is foundin the local write cache, checking the local read cache for a datasegment match; if no data segment match is found in the local readcache, transmitting the read access request to the remote Data Serverfor serving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the local read cache.
 8. In a switching system adapted tointerconnect local clients in communication with a Client Server, theClient Server being operable to communicate with a remote Data Server torequest access to data files on storage devices connected to the remoteData Server, the switching system being operable to provide mediationbetween storage and networking protocols used for communication betweenclients, servers and storage devices, a method of accelerating responseto write access requests by clients, the method comprising: providing alocal read cache operable to communicate with the Client Server,operable to store a copy of recently read data; providing a local writecache operable to communicate with the Client Server, operable to storea copy of data to be written; receiving a write request from a first oneof the local clients in communication with the Client Server, inresponse to receipt of the write request, checking the local read cachefor a matching read segment and, if a matching read segment is detected,invalidating the matching read segment, checking the local write cachefor matching write segments and, if a matching write segment isdetected, invalidating or reusing the write segment, and generating anew local write cache entry representing the write data segment.
 9. In aswitching system including at least first and second Client Servers,each of the first and second Client Servers being operable tocommunicate with (1) respective local clients and (2) a remote DataServer to request access to data files on storage devices connected tothe remote Data Server, the switching system being operable to providemediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating read access to data by clients, the method comprising:providing, for each of the first and second Client Servers, a respectivelocal read cache operable to communicate with the respective ClientServer, operable to store a copy of recently read data; providing, foreach of the first and second Client Servers, a respective local writecache operable to communicate with the resoective Client Server,operable to store a copy of data to be written; receiving a read accessrequest from one of the local clients in communication with the first orsecond Client Server; in response to receipt of the read access request,checking the respective local write cache for a data segment match; ifno data segment match is found in the respective local write cache,checking the respective local read cache for a data segment match; ifthe segment is found in the respective local read cache, transmitting tothe remote Data Server a request to determine the validity of the datain the respective local read cache, thereby to determine whether thedata in the respective local read cache must be updated from the remoteData Server, if the data in the respective local read cache is notvalid, or if no data segment match is found in the respective local readcache, transmitting the read access request to the remote Data Serverfor serving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the respective local read cache.
 10. The method of claim 9,further comprising: assigning a time-stamp to a data segment stored inthe respective local read cache; assigning a time-stamp to a datasegment stored in the remote Data Server; upon receipt of a request todetermine the validity of the data in the respective local read cache,comparing the time-stamp of the data segment stored in the respectivelocal read cache with the time-stamp of a comparable data segment storedin the remote Data Server to determine whether the data segment in therespective local read cache is older than the comparable data segment onthe remote data server; and if the data segment in the respective localread cache is older than the comparable data segment on the remote dataserver, designating as invalid the data segment in the respective localread cache.
 11. The method of claim 10 further comprising: accumulatingin the respective local write cache multiple segments of data to bewritten; and subsequently transmitting the multiple segments of data tobe written in a batch operation to the remote Data Server.
 12. Themethod of claim 11 wherein the multiple segments of data to be writtenare transmitted in a semi-contiguous write access to the remote DataServer.
 13. The method of claim 10, further comprising: if the datasegment in the respective local read cache is designated invalid, thentransmitting the read access request to the remote Data Server forserving of the requested data; and once the requested data istransmitted from the remote Data Server, storing a copy of the requesteddata in the respective local read cache and updating the respectivetime-stamps.
 14. In a switching system connectable to at least first andsecond Client Servers, each of the first and second Client Servers beingoperable to communicate with (1) respective local clients and (2) aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the switching system being operableto provide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a method ofaccelerating response to write access requests by clients, the methodcomprising: providing, for each of the first and second Client Servers,a respective local read cache operable to communicate with therespective Client Server, operable to store a copy of recently readdata; providing, for each of the first and second Client Servers, arespective local write cache operable to communicate with the respectiveClient Server, operable to store a copy of data to be written; receivinga write request from one of the local clients in communication with thefirst or second Client Server; in response to receipt of the writerequest, checking the respective local read cache for a data segmentmatch, and if a matching data segment is detected, invalidating thematching data segment; checking the respective local write cache for adata segment match, and if a matching write segment is detected,invalidating or reusing the matching write segment; transmitting to theremote Data Server a request to determine whether the write segment isavailable for writing, and if the segment is unavailable, waiting forthe write segment to become available; generating a new write cacheentry representing the write data segments to be written; andtransmitting to the remote Data Server a request to unlock the datasegments to be written.
 15. The method of claim 14 further comprising:determining whether data segments are available for writing by the firstClient Server by checking whether the data segments are being written orare locked by the second Client Server during the time the first ClientServer requests write access to the same or overlapping data segments.16. The method of claim 15 wherein the determining further comprisesgenerating a lock request on the requested data segment if the segmentis available.
 17. In a digital network having at least first and secondClient Servers, each of the first and second Client Servers beingoperable to communicate with (1) respective local clients and (2) aremote Data Server to request access to data files on storage devicesconnected to the remote Data Server, the network being operable toprovide mediation between storage and networking protocols used forcommunication between clients, servers and storage devices, a system foraccelerating read access to data by clients, the system comprising:means for providing, for each of the first and second Client Servers, arespective local read cache operable to communicate with the respectiveClient Server, operable to store a copy of recently read data; means forproviding, for each of the first and second Client Servers, a respectivelocal write cache operable to communicate with the respective ClientServer, operable to store a copy of data to be written; means forreceiving a read access request from one of the local clients incommunication with the first or second Client Server; means for, inresponse to receipt of the read access request, checking the respectivelocal write cache for a data segment match; if no data segment match isfound in the respective local write cache, checking the respective localread cache for a data segment match; if the segment is found in therespective local read cache, transmitting to the remote Data Server arequest to determine the validity of the data in the respective localread cache, thereby to determine whether the data in the respectivelocal read cache must be updated from the remote Data Server, if thedata in the respective local read cache is not valid, or if no datasegment match is found in the respective local read cache, transmittingthe read access request to the remote Data Server for serving of therequested data; and means for, once the requested data is transmittedfrom the remote Data Server, storing a copy of the requested data in therespective local read cache.